Should you choose to do a PhD?

Should you choose to do a PhD?

Is it your cup of tea?


Share:        
  1. Properties of a Ph.D.
    1. Personal Growth
    2. Expertise
    3. Ownership
    4. Exclusivity
    5. Status
    6. Personal Freedom
    7. Maximizing Future Choice
    8. Maximizing Variance
    9. Freedom
    10. But wait, there is more…
  2. So, you have made the choice. What next?
    1. Getting into a Ph.D. program: (1) references, (2) references, (3) references.
    2. Picking the school

Preamble: I explicitly state that this post is not intended to convince anyone to get a Ph.D. I try to enumerate some common considerations below.

First, should you want to get a Ph.D.? I was in a fortunate position of knowing since when I started my M.Sc. research that I really wanted a Ph.D. Unfortunately it was not for very well gauged and pondered considerations: First, I really liked grad school and learning things, and I wanted to learn as much as possible. Second, I met a many industry Ph.D. who thrived in an industry that is now riding into the sunset. Third, in academia I was surrounded by Ph.D.s and it is taken for granted that you will get a Ph.D. If you are more thorough in making your life’s decisions, What should YOU do?

Many books exist and many a Ph.D. graduate has written about the topic. For example some of the points below were taken from Justin Johnson’s post at Quora: I got a job offer from Google, FB, and MS and laso got accepted into the Ph.D……. Another great Quora threat is: Quora: What is the purpose of doing a Ph.D.. And of course there are many, many subreddits such as r/PhD or many specific career subreddits that often discuss the topic daily.

Assuming that if you are not opting for a Ph.D., another choice for you to consider is joining a medium-to-large company.

Ask yourself if you find the following attributes attractive:

Properties of a Ph.D.

Personal Growth

A Ph.D. is an intense experience of rapid growth -> You soak up knowledge like a sponge. It is a time of personal self-discovery during which you will become a master of managing your own psychology. Many Ph.D. programs offer a high density of exceptionally bright people who have the potential to become your Best Friends Forever.

Expertise

A Ph.D. is probably your only opportunity in life to really delve deeper into a topic and become a recognized leading authority at something within the scientific discourse. You are exploring the edge of humankind’s knowledge, without the burden of lesser distractions or constraints. There is something poetic about that and if you disagree, it could be a sign that Ph.D. is not for you.

Ownership

The research you produce will be yours as an individual and papers with you as first author. Your accomplishments will have your name stamped on them. In contrast, it is much more likely to disappear inside a larger company. A common sentiment is becoming an insignificant member in an anonymous apparatus, a “cog in the machine”.

Exclusivity

Few people make it into top Ph.D. programs. You join a group of a few hundred distinguished individuals contrasted by a few tens of thousands that will join private industry without Ph.D.

Status

While many people in industry tend to belittle Ph.D.s (i.e., “You will work on the matter of a whole Ph.D. every three months in here…”), working towards and eventually getting a Ph.D. degree is societally honored and recognized as an impressive achievement. You also get to be a Doctor; that is great.

Personal Freedom

As a Ph.D. student/candidate you will be your own leader. Bad night and slept in? That works. Skip a day for a vacation? No problem. It is like these companies that have give limitless days off. All that matters is your result and no-one will force you to clock in from 09:00 to 17:00. Some profs (i.e., advisers) might be more or less flexible about it and some companies might be as well.

Maximizing Future Choice

For cutting-edge subjects that just make it out of academia into industry, there is no penalty for obtaining a Ph.D., since you know the latest techniques or even created them. Employers pay big bucks for your talent and brains to invent the next quantum computer etc. For more mature subjects where less revolutionary research happens, lower degrees may be prefered. So if you join a Ph.D. program in a cutting-edge subject, it does not close any doors or eliminate future employment/lifestyle options. You can go from Ph.D. to Anywhere but not the other way Anywhere to Ph.D./Academia/Research. Additionally, it is less likely to get a non-Ph.D. job in academia. Additionally (although this might be quite specific to applied subjects), you are strictly more hirable as a Ph.D. graduate or even as an A.B.D. and many companies might be willing to put you in a more interesting position or with a higher starting salary. More generally, maximal choice for your future is a good path to follow.

Maximizing Variance

You are young and there is really no need to rush. Once you graduate from a Ph.D. you can spend the next tens of years of your life in some company. Choose more variance in your experiences, which will avoid getting pigeon-holed.

Freedom

A Ph.D. will offer you a lot of freedom in the topics you wish to pursue and learn about. You are in charge. Of course, you will have an adviser who will guide you in a certain direction but in general you will have incredible freedom working on your degree than you might find in industry.


Christian Haller questionmarklight

Photo by @emilymorter on Unsplash

But wait, there is more…

While the points above sound neutral to positive, it cannot go unsaid that the Ph.D. is a very narrow and specific kind of experience that deserves a large disclaimer:

You will inevitably find yourself working very hard for qualification exams, before conference talks, or submission of the written part near graduation. You need to be accepting of the suffering and have enough mental stamina and determination to deal with the pressure that comes with these crunch times. At some points you may lose track of what day of the week it is and temporarily adopt a questionable diet. You will be tired, exhausted, and alone in the lab on a beautiful, sunny weekend looking at Facebook posts of your friends having fun in nature and on trips, paid with their 5-10 times bigger salaries.

It could happen that you will have to throw away months of your work while somehow keeping your mental health intact. You will live with the fact that months of your time on this Earth were spent on a journal paper that draws limited interest (citations), while your friends participate in exciting startups, get hired at prestigious companies, or get transferred to exotic locations only to become manager over there.

You will experience identity crises during which you will question your life’s decisions and wonder what you are doing with some of the best years of your life. As a result, you can be certain that you can thrive in an unstructured environment in the chase for new discoveries for science. If you are unsure then you should lean towards other pursuits. Ideally you should consider dipping your toe into research projects as an undergraduate or Masters student before before you decide to commit. In fact, one of the primary reasons that research experience is so desirable during the Ph.D. hiring process is not the research itself, but the fact that the student is more likely to know what they are committing to.

Lastly, as a random thought I read and heard that you should only do a Ph.D. if you want remain in to academia. In light of all of the above I would argue that a Ph.D. this is an urban myth and that the Ph.D. has strong intrinsic value of great appeal to many. It is an end by itself, not just a means to some end (e.g., career in academia).


Christian Haller pcwork

Photo by @homajob on Unsplash

So, you have made the choice. What next?

Getting into a Ph.D. program: (1) references, (2) references, (3) references.

Now how do you get into a good Ph.D. program? The very first step to conquer is quite simple - the by far most important component are strong reference letters.

  • The ideal scenario is that your professor writes you a glowing letter along the lines of: “X is in top five of students I have ever worked with. X takes initiative, comes up with their own ideas, and can finish a project.”

  • The worst letter is along the lines of: “X took my class. They are a good student/did well.” That may sound good at first sight, but by academic conventions is not good enough to impress anyone.

A research publication under your belt from a class project (or similar) is a very strong bonus and may often be requested in the application process. But it is not absolutely required, provided you have two/three strong reference letters. In particular note: grades are quite irrelevant, and do not have to be stellar. But you generally do not want them to be out of the usual low. This is not obvious to many students, since many spend a lot of energy on getting good grades, which is a diminishing returns problem. This time spent should be instead directed towards research and thesis work as much and as early as possible. Working under supervision of multiple professors lets them observe your work ethic and skills so that they can confidently write reference letters. As a last point, what will not help you too much is making appearances in the profs offices out of the blue without (recurring?) scheduled meeting times. They are often very busy people and if you do not adhere to appointment schedules, then that may lead to doors closed in your face and even worse, that may backfire on you.

Picking the school

Once you get into some Ph.D. programs, how do you pick the school? Your dream university should:

  1. …be a top university. Not only because it looks good on your résumé and CV but because of less obvious feedback loops. Top universities attract other top students/professors, many of whom you will get to know and work with.

  2. …have a hand full of potential professors you would want to work with. I really do mean a hand full - this is very important and provides a safety net for you if things do not work out with your top choice for any one of dozens of reasons - things in many cases outside of your control, e.g., your dream professor leaves, moves with all the lab equipment but without students, retires, dies (I have seen it a couple of times), has Title IX issues, or spontaneously disappears

  3. …be in a good environment physically. I do not think new students appreciate this enough: you will spend ~5 years of your really good years living in a city that best not be a place you do not really enjoy. Really, this is a long period in your life. Your time in this city consists of much more than just your research and work.

To be continued…

Self-Supervised Learning - A recap by Facebook AI

Self-Supervised Learning - A recap by Facebook AI

Are we done with labeling yet?


Share:        
  1. Preface
  2. Introduction
  3. Self-supervised learning is predictive learning
  4. Self-supervised learning for NLP and CV
  5. Modeling the uncertainty in prediction
  6. A unified view of self-supervised methods
    1. Joint embedding, Siamese networks
      1. Contrastive energy-based SSL
      2. Non-contrastive energy-based SSL (regularization-based)
  7. Advancing self-supervised learning for vision at Facebook
  8. Summary

Preface

This is my personal summary from reading Self-supervised learning: The dark matter of intelligence (4th March 2021), which was published on the Facebook AI blog https://ai.facebook.com/blog/. This department at Facebook is dedicated to applied research, but also assembles smart scientists that undertake relatively basic, but ground-braking research. The article is offering an overview of Facebook’s AI-frontier research on self-supervised learning (SSL) in the wake of other papers published by Facebook in the same month DINO and PAWS: Advancing the state of the art in computer vision.

Introduction

Supervised models are specialized on relatively narrow tasks and require massive amounts of carefully labeled data. As we want to expand the capabilities of state-of-the-art models to classify things, “it’s impossible to label everything in the world. There are also some tasks for which there’s simply not enough labeled data […]”. The authors postulate that supervised learning may be at the limits of its capabilities and AI can develop beyond the paradigm of training specialist models. Self-supervised learning can expand the use of specialised training sets to bring AI closer to a generalized, human-level intelligence.

Going beyond a specialized training set and forming a common sense from repeated trial-and-error cycles is in essence what humans and animals do. Humans learn without massive amounts of teaching all kinds of possible variants of the same but rather build upon previously learned input. This common sense is what AI research is aspiring to when referring to self-supervised learning and what has remained elusive for a long time. The authors call this form of generalized AI the “Dark Matter” of AI: it is assumed to be everywhere and pervasive in nature, but we can’t see or interact with it (yet?).

While SSL is not entirely new. It has found widespread adoption in training models in Natural Language Processing (NLP). Here, Facebook AI uses SSL for Computer Vision tasks and explains why SSL may be a good method for unlocking the “Dark Matter” of AI.

Self-supervised learning is predictive learning

Self-supervised learning is trained by signals from the data itself instead of labels. The data in the training set is not a complete set of observations (samples) and thus the missing parts are predicted using SSL. In an ordered dataset such as a sentence or a series of video frames, missing parts or the future/past can be predicted without the use of labels.

Since training signals are originating from the dataset itself, the term self-supervised learning is more fitting than the older term “unsupervised learning”. The authors argue that self-supervised learning uses more supervision signals than both supervised and reinforcement learning.

Self-supervised learning for NLP and CV

In Natural Language Processing, a model is trained by showing it long texts - but no labels. These models are pretrained in a self-supervised phase and then fine-tuned for a use case, such as classifying the topic of a text (see sentiment analysis etc.). When given a sentence with blanked out words, the model can predict the “correct” words. The same concept was tried to apply towards Computer Vision (CV) but so far has never shown the same improvements as in NLP. A large challenge to overcome is the expression of uncertainty for predicted gaps in CV, which was relatively easy in NLP. That is because NLP always works with a finite vocabulary and each word can be given a probability value. Having to design a model for CV requires incorporating accurate representations of errors, which used to be costly in computation and memory.

In the past, Facebook’s FAIR workgroup designed a new architecture of Convolutional Networks called RegNet (Regular Network) that solves this issue and efficiently stores billions/trillions of network parameters. It is still impossible to assign error to an infinite number of possible missing frames or patches within a frame, since we do not have a vocabulary as in NLP. However, efficient SSL techniques such as SwAV are improving vision-task accuracy using billions of samples.

Christian Haller book

Photo by @priscilladupreez on Unsplash

Modeling the uncertainty in prediction

During NLP’s prediction process the uncertainty is expressed with a softmax function across the entire vocabulary. That means the sum of all vocabulary probabilities add up to 1.0 and the most likely word gets the highest probability assigned.

In CV, a frame or patch in a frame out of infinite possibilities is predicted instead of NLP’s discrete vocabulary. We may never be able to assign prediction scores to infinite frames of a high-dimensional continuous space or find techniques to solve this problem.

A unified view of self-supervised methods

Self-supervised models can be viewed within the framework of energy-based models (EBM). For this model, x is trained material, which is compared to y. A single energy number tells how compatible x and y are or how likely it is that video clip/photo y follows clip/photo x. Low energy denotes good compatibility, high energy bad compatibility. The smaller the differences in x and y, the lower the energy value will be.

Joint embedding, Siamese networks

Siamese networks use two identically architected deep neural networks with shared parameters that encode two images. The networks calculate the two image embeddings, which are compared with each other. The networks are trained to produce low energies, which makes it sometimes difficult to calculate high energies when needed for different images. Several techniques exist to avoid the network to produce very similar embeddings for every sample, which is called “collapse”.

Christian Haller siamese twins

Photo by @mrcageman on Unsplash

Contrastive energy-based SSL

Latent-variable models can be trained with contrastive methods. A good example of contrastive models is the Generative Adverserial Network (GAN) architecture. In SSL, selecting well-chosen pairs of x, y that are incompatible to achieve high energies.

As desribed above, NLP problems are easier to tackle with its finite vocabulary.

A latent-variable predictive architecture can be given an observation x, the model must be able to produce a set of multiple compatible predictions. Latent-variable predictive models contain an extra input variable z. This variable z is called latent because its value is never observed. As the latent variable z varies within a set, the output varies over the set of predictions.

However, it is very difficult to find maximally incompatible pairs of images (high-dimensional data). So, how could it be possible to increase energy for incompatible candidates?

Non-contrastive energy-based SSL (regularization-based)

This is a very actively researched field in SSL and are divided into two groups:

  1. computing virtual target embeddings for groups of similar images
    • DeeperCluster
    • SwAV
    • SimSiam
  2. making the two encoders slightly different (architecture or parameter vector)
    • BYOL
    • MoCo

However, the authors hypothesize that it may be beneficial in the future to combine non-contrastive models with latent variables.

Future research will focus on top performing models without requiring large amounts of labeled data.

Christian Haller predict

Photo by @vork on Unsplash

Advancing self-supervised learning for vision at Facebook

A new network was trained at Facebook AI called SEER that may bring about a paradigm shift in CV:

  • The network contains 1 billion parameters, with the SwAV method applied to a Convolutional Network.
  • It was trained on random Instagram images without any labels.
  • It outperforms any other self-supervised dataset.
  • It reached 84.2 percent top-1 accuracy on ImageNet data.

Summary

Supervised learning is at the forefront of everyone’s attention today. Self-supervised methods have been restricted to certain niches such as NLP, where predictions were descrete and finite. Current challenges with bringing Self-supervised methods to CV requires solving problems of calculating probabilities for infinite continuous, hyperdimensional predictions. Siamese networks trained on non-contrasting data is current state-of-the-art to avoid endless training of supervised networks. Facebook released a new, open-source network called SEER that outperforms any other self-supervised network after it was pre-trained on random, un-labeled images.

SQL Table Management

SQL Table Management

I accidentally dropped a table. Not.


Share:        
  1. Making and Deleting Tables
    1. Create Table
    2. Drop Table
  2. Information Schema
    1. List all Tables
    2. List all Columns in a Table
    3. List Table Constraints
  3. Insert Into
    1. Insert a Row
    2. Insert a Column
  4. Update Table
  5. Delete Entries
  6. Alter Table
    1. Rename Column
    2. Add Column
    3. Drop Column
    4. Column Type
    5. Set Not Null, Drop Not Null
    6. Unique Constraint
    7. Primary Key Constraint
    8. Primary Key / Add Surrogate Key Column
    9. Foreign Key Constraint (1 to n)
    10. Foreign Key Constraint (n to m)
    11. Key Referential Integrity
  7. Triggers
    1. Data Manipulation: AFTER triggers
    2. Data Manipulation: INSTEAD OF triggers
    3. Data Definition: FOR triggers
    4. Listing all triggers and trigger events in the system
    5. Viewing a specific trigger
    6. Listing all server, DB, and table triggers with definitions
    7. Disable / Enable / Drop / Alter Triggers

Database tables are split up into sub-tables to avoid saving redundant records. This requires the maintenance of more complex systems, but it may be more efficient after all. However, while querying and joining we can combine these tables again. This post contains a number of SQL used for managing values, columns, tables, and databases.

Of course, the original tables and outputs are not available here, but the point is providing a lookup for structures and building blocks.

Making and Deleting Tables

Create Table

New tables can be created directly in SQL syntax.

CREATE TABLE table_name (
	column1 dtype,
	column2 dtype,
	column3 dtype
);

For example:

CREATE TABLE rocktypes (
	ID integer PRIMARY KEY,
	name text NOT NULL,
	type text UNIQUE,
	clevage_angle integer
);

Most common Data Types:

DataTypeDescription
textString of any length
varchar[ (x) ]Maximum length of n characters
char[ (x) ]String of set length n
booleanTRUE, FALSE, NULL
date, time, timestamptime formats
numericno precision defined
integer-2147483648 to 2147483647

PostgreSQL Documentation: DataType table

Drop Table

Delete the entire table.

DROP TABLE table_name;

Information Schema

List all Tables

SELECT table_name 
FROM information_schema.tables
WHERE table_schema = 'public';

List all Columns in a Table

SELECT column_name, data_type 
FROM information_schema.columns
WHERE table_name = 'employees' AND table_schema = 'public';

List Table Constraints

List all foreign Keys

SELECT constraint_name, table_name, constraint_type
FROM information_schema.table_constraints
WHERE constraint_type = 'FOREIGN KEY';

Insert Into

Insert a Row

INSERT INTO table_name (existing_column1, existing_column2)
VALUES (val1, val2);

For Example:

INSERT INTO employee (firstname, lastname, department_shortname)
VALUES (NULL, 'Miller', 'SLS');

Insert a Column

INSERT INTO table_name
SELECT column1
FROM old_table;

For Example:

Use DISTINCT to remove duplicate records.

INSERT INTO rocktypes
SELECT DISTINCT outcrop, handpiece
FROM samples;

/* double check */
SELECT *
FROM rocktypes;

Update Table

Similar as INSERT INTO column, but with additional conditoins.

UPDATE connections
SET connections.employee_id = employees.id
FROM employees
WHERE connections.firstname = employees.firstname AND connections.lastname = employees.lastname;

Delete Entries

Delete rows selectively where conditions are met.

DELETE FROM organizations
WHERE id = 'ABC';

Alter Table

Rename Column

Have you made a type when creating a table?

ALTER TABLE table_name
RENAME COLUMN old_colname TO new_colname;

Add Column

ALTER TABLE table_name
ADD COLUMN id varchar(128);

Drop Column

Delete a column in a table.

ALTER TABLE table_name
DROP COLUMN column_name;

Column Type

After creation, the datatype of a column can still be modified. Extension of a string lenghth is easy.

ALTER TABLE table_name
ALTER COLUMN column_name
TYPE varchar(256);

Shortening the length of a value requires a processing step first. For example, a float type gets rounded first and then cast to integer type.

ALTER TABLE table_name
ALTER COLUMN column_name
TYPE integer
USING ROUND(long_num_value);

For example, when shortening string entries,a ‘SUBSTRING’ has to be created first selecting the characters from 1 to x.

ALTER TABLE table_name
ALTER COLUMN column_name
TYPE varchar(16)
USING SUBSTRING(column_name FROM 1 FOR 16)

Set Not Null, Drop Not Null

By default, a colum can contain NULL values. A column’s ‘NOT NULL’ flag can be set after the table was created.

ALTER TABLE table_name
ALTER COLUMN column_name
SET NOT NULL;

Remove ‘NOT NULL’.

ALTER TABLE table_name
ALTER COLUMN column_name
DROP NOT NULL;

Unique Constraint

Note: The unique constraint will require a new column be created.

ALTER TABLE table_name
ADD CONSTRAINT new_col_name UNIQUE(column_name);

Primary Key Constraint

ALTER TABLE table_name
ADD CONSTRAINT new_col_name PRIMARY KEY(old_col_name);

Primary Key / Add Surrogate Key Column

Add a primary key serial type.

ALTER Table table_name
ADD COLUMN id serial PRIMARY KEY;

Foreign Key Constraint (1 to n)

Generate a foreign key column that references a primary key in another table.

Assuming a star pattern schema, the central table will get a foreign key added to an central_table_id that refernece the branch_id.

ALTER TABLE [central_table] 
ADD CONSTRAINT [my_name_for_FKey_constraint] FOREIGN KEY (central_table_id) REFERENCES [branch_table] (branch_id);

For example:

Inserting entries into the employees table that violate the foreign key table, e.g. non-existing company, will result in error.

ALTER TABLE employees 
ADD CONSTRAINT employees_fkey FOREIGN KEY (company_id) REFERENCES companies (id);

Drop / Delete a Foreign Key Column

ALTER TABLE connections
DROP CONSTRAINT connections_organization_id_fkey;

Foreign Key Constraint (n to m)

This is two foreign key columns referencing other tables in a many to many relationship.

CREATE TABLE connections (
    employee_id integer REFERENCES employees (id),
    organization_id varchar(256) REFERENCES organizations (id),
    function varchar(256)
	);

Key Referential Integrity

Foreign Keys referencing other tables can be used to prevent entering or deleting data that is not given/referred to in the other table.

We can determine what happens to the referenced table if an entry of our foreign key gets deleted.

Example: NO ACTION

If you delete an entry in table b, the system will raise an error.

CREATE TABLE a (
    id integer PRIMARY KEY,
    column_1 varchar(64),
    b_id integer REFERENCES b (id) ON DELETE NO ACTION
	);

Example: CASCADE

If you delete an entry in table b, all referencing entries in table a will be deleted, too.

CREATE TABLE a (
    id integer PRIMARY KEY,
    column_1 varchar(64),
    b_id integer REFERENCES b (id) ON DELETE CASCADE
    );
ALTER TABLE connections
ADD CONSTRAINT connections_organization_id_fkey FOREIGN KEY (organization_id) REFERENCES organizations (id) ON DELETE CASCADE;
ON DELETE…Description
NO ACTIONError
CASCADEDelete all referencing entries
RESTRICTError
SET NULLSet referencing entry to NULL
SET DEFAULTSet referencing entry to default value

Triggers

Data Manipulation: AFTER triggers

Use after INSERT, UPDATE, DELETE data manipulation.

The tables ‘deleted’ and ‘inserted’ are automatically created by SQL Server.

Example 1: AFTER DELETE

CREATE TRIGGER TrackDeletedProducts
ON ProductsTable
AFTER DELETE
AS
    INSERT INTO DeletedProducts (Product, Price)
    SELECT Product, Price
    FROM deleted;

Example 2: AFTER INSERT, UPDATE

The ‘inserted’ is used for both ‘INSERT’ and ‘UPDATE’.

CREATE TRIGGER TrackClientHistory
ON ClientTable
AFTER INSERT, UPDATE
AS
    INSERT INTO ClientHistoryTable (Client, ClientID, Address, PhoneNumber)
    SELECT Client, ClientID, Address, PhoneNumber, GETDATE()
    FROM inserted;

Example 3: After INSERT, execute e-mail sending script.

CREATE TRIGGER NewClientPushEmail
ON ClientTable
AFTER INSERT
AS
    EXECUTE SendNotification @RecipientEmail = '[email protected]'
                            ,@EmailSubject = 'New Client Signup'
                            ,@EmailBody = 'A new client just signed up with us.';

Example 4: Differentiate UPDATE and Insert with CASE

Write changes into a ClientAuditTable

CREATE TRIGGER ClientAudit
ON Orders
AFTER INSERT, UPDATE, DELETE
AS
    DECLARE @Insert BIT = 0;
    DECLARE @Delete BIT = 0;
    IF EXISTS (SELECT * FROM inserted) SET @Insert = 1;
    IF EXISTS (SELECT * FROM deleted) SET @Delete = 1;
    INSERT INTO ClientAuditTable (TableName, EventType, UserAccount, EventDate)
    SELECT 'Orders' AS TableName
           ,CASE WHEN @Insert = 1 AND @Delete = 0 THEN 'INSERT'
                 WHEN @Insert = 1 AND @Delete = 1 THEN 'UPDATE'
                 WHEN @Insert = 0 AND @Delete = 1 THEN 'DELETE'
                 END AS Event
           ,ORIGINAL_LOGIN() AS UserAccount
           ,GETDATE() AS EventDate;

Data Manipulation: INSTEAD OF triggers

Use instead of INSERT, UPDATE, DELETE data manipulation.

Example 1: Instead of updating, raise an error message and send a push email.

CREATE TRIGGER PreventProductsUpdate
ON Products
INSTEAD OF UPDATE
AS
    DECLARE @EmailBodyText NVARCHAR(50) = 
                        (SELECT 'User ' + ORIGINAL_LOGIN() + ' tried to update the Products table.');
						
    RAISERROR ('Updates on "Products" table are not permitted.', 16, 1);
	
    EXECUTE SendNotification @RecipientEmail = '[email protected]'
                            ,@EmailSubject = 'User tried to update Products'
                            ,@EmailBody = @EmailBodyText; 

Example 2: Instead of inserting, check first if condition is met, then resume insert

CREATE TRIGGER ConfirmInventory
ON OrderTable
INSTEAD OF INSERT
AS
    IF EXISTS (SELECT * FROM inventory AS iv
               INNER JOIN  inserted  AS is ON  is.Product = iv.Product
               WHERE iv.Quantity  < is.Quantity )
    RAISEERROR ('You cant order higher quantities than there is in the inventory.', 16, 1);
	
    ELSE
        INSERT INTO dbo.OrderTable (Client, Product, Quantity, Date, Total)
        SELECT Client, Product, Date, Total FROM inserted;

Data Definition: FOR triggers

FOR triggers can only be run after, not instead of. That means ‘FOR’ is a synonym for ‘AFTER’.

It is impossible to prevent server or database events with ‘INSTEAD OF’, however they can be rolled back with ‘FOR’

The ‘DATABASE’ level includes all tables in the database.

CREATE TRIGGER TrackTableChanges
ON DATABASE
FOR CREATE_TABLE,
    ALTER_TABLE,
    DROP_TABLE
AS
    INSERT INTO TablesLog (EventData, User)
    VALUES (EVENTDATA(), USER);

Roll back a database-level event with ‘FOR’.

CREATE TRIGGER PreventTableDrop
ON DATABASE
FOR DROP_TABLE
AS
    RAISERROR ('DROP is not allowed.', 16, 1);
    ROLLBACK;

Roll back a server-level event with ‘FOR’.

CREATE TRIGGER PreventDatabaseDrop
ON ALL SERVER
FOR DROP_DATABASE
AS
   PRINT 'You are not permitted to drop databases.';
   ROLLBACK;

Listing all triggers and trigger events in the system

Server-level triggers:

SELECT * FROM sys.server_triggers;

SELECT * FROM sys.server_trigger_events;

Database + Table-level triggers (see column ‘parent_class_desc’):

SELECT * FROM sys.triggers;

SELECT * FROM sys.trigger_events;

Example: Show all disabled triggers

SELECT name,
	   object_id,
	   parent_class_desc
FROM sys.triggers
WHERE is_disabled = 1;

Viewing a specific trigger

The name of a specific trigger has to be known and then can be entered as object_id.

Method 1:

SELECT definition
FROM sys.sql_modules
WHERE object_id = OBJECT_ID ('PreventTableDrop');

Method 2:

SELECT OBJECT_DEFINITION (OBJECT_ID ('PreventTableDrop'));

Listing all server, DB, and table triggers with definitions

SELECT name AS TriggerName,
       parent_class_desc AS TriggerType,
       create_date AS CreateDate,
       modify_date AS LastModifiedDate,
       is_disabled AS Disabled,
       is_instead_of_trigger AS InsteadOfTrigger,
       OBJECT_DEFINITION  (object_id)
FROM sys.triggers
UNION ALL
SELECT name AS TriggerName,
       parent_class_desc AS TriggerType,
       create_date AS CreateDate,
       modify_date AS LastModifiedDate,
       is_disabled AS Disabled,
       0 AS InsteadOfTrigger,
       OBJECT_DEFINITION  (object_id)
FROM sys.server_triggers
ORDER BY TriggerName;

Disable / Enable / Drop / Alter Triggers

On table level:

DISABLE TRIGGER PreventProductsUpdate;

ENABLE TRIGGER PreventProductsUpdate;

DROP TRIGGER PreventProductsUpdate;

On database level:

DISABLE TRIGGER PreventTableDrop
ON DATABASE;

ENABLE TRIGGER PreventTableDrop
ON DATABASE;

DROP TRIGGER PreventTableDrop
ON DATABASE;

On Server level:

DISABLE TRIGGER PreventDatabaseDrop
ON ALL SERVER;

ENABLE TRIGGER PreventDatabaseDrop
ON ALL SERVER;

DROP TRIGGER PreventDatabaseDrop
ON ALL SERVER;

‘ALTER TRIGGER’

Altering triggers works with the same syntax as ‘CREATE TRIGGER’, but combines the dropping and creating of a trigger in one step.

ALTER TRIGGER PreventDatabaseDrop
ON ALL SERVER
FOR DROP_DATABASE
AS
   PRINT 'You are ABSOLUTELY NOT permitted to drop databases.';
   ROLLBACK;

Data Science Hiring Theory

Data Science Hiring Theory

The ultimate process does not exist. Only compromises.


Share:        
  1. Interview Style
  2. Productivity and Knowledge
    1. Summary
  3. Graduate and Professional Productivity History
    1. Professional
    2. Graduates
    3. Summary
  4. Current Skill Set vs. Learning
    1. Lateral Transfers
    2. Summary

Interview Style

For starting a career as Data Scientist (or moving laterally, pivoting) you need to go through interviews such as for any other role. Either way, Data Science has its very own rules, pitfalls, and pre-conceptions that are hard to navigate. Sometimes, the process is more similar to Software Engineering, sometimes closer to Business.

Candidates/applicants want to know what is special about a Data Science interview. Hiring managers want to know which is the right or wrong way to interview.

Having had conversations with friends and family in the technology sector, linkedin.com, anonymous platforms such as reddit.com, and teamblind.com, I came to the conclusion that there is no universally right way to interview. This is super important for both hiring managers and candidates to understand.

Why? It is all about making compromises. There are no absolutes:

  • Shorter, more succinct interview processes introduce a higher risk of hiring the suboptimal candidate. Longer, more in-depth interview processes, introduce a higher risk of false negatives (i.e., rejecting a candidate who would actually be great for the role), and a higher risk of just rejecting qualified applicants.

  • A strong focus on quizzes / coding tests / quick reaction questions / take-home assignments can introduce a high risk of giving good scores to applicants who either prepared super well how to interview or simply got lucky and knew the answers to those specific questions. Focusing instead only on their work experience introduces a high risk of hiring someone who cannot improvise.

Productivity and Knowledge

Having said that, in my experience there are a couple of universal facts in a statistical sense that really help with interviewing:

A history of continuously being productive is with high likelihood an indicator for being productive in the future. Applicants with a history / résumé that lack productivity are unlikely to become productive in a new role.

This is, of course, a relatively sweeping judgement, I know. Let me explain. Many candidates look great on paper, had the right classes, know the right things, talk convincingly about their abilities… – but for some reason, when they get hired, would struggle to get things done. Everything would take too long, people overthink things, don’t know when to ask for help, etc.

And that is when the red flags showed up (retroactively): The candidates knew all the facts, but had never produced output. Whether in grad school (publications, projects,…) or at work. The ratio of productivity/knowledge is out of balance.

According to some people, this is mostly an inherent trait. Some candidates just find ways to get things done, and other candidates find ways to NOT get things done. Many people believe that there is very little you can do as a manager to change that in a person.

This leads to the realization that if a candidate who aces every part of the interview but has nothing to show for in his previous roles becomes a concern - unless they have an explanation as to why. Most of the time there is no satisfactory reason why.

Summary

Hiring managers:

Focus on evidence of productivity in candidates.

Candidate:

Make your résumé scream “I will get stuff done”. Achievements outweigh knowledge.


Christian Haller phone call

Photo by @magnetme on Unsplash

Graduate and Professional Productivity History

Everyone is going to be compared against a benchmark, but the benchmarks are different. Someone fresh out of a Masters degree is going to get graded on a different scale than someone with two years of experience. But either way, the candidate needs to look productive relative to their peers.

Candidates do not need to provide excessive detail to show they were productive. Ideally, a candidate can write/say something along the lines of “Delivered $2M in revenue by implementing a [some ML] model in [language]”. If a professional candidate is not allowed to disclose those details, they can remove the quantities and keep it vaguer. What’s critical is that they provide a cumulative list of projects / impact that looks compelling. If the candidate has worked for an employer for two years, they need to either have one to two really impressive accomplishments, or four to six more moderate/minor ones. They need a combination of volume and impact that stands out.

Professional

Something candidates often miss: it is not good practice to combine or aggregate multiple projects (e.g., “Delivered $[x]M in value across several projects”). Instead, listing every single project is likely to be more successful. Do not say/write things such as “Collaborated on several cross-department initiatives”. List every single one and what was accomplished in them – even if minor.

For example, compare these two lists:

  • Delivered $[x]M in revenue as part of a multi-year, strategic plan to overhaul sales analytics process.

vs.

  • Generated $[xxx]K in costs savings by optimizing ad spend across channels using a linear programming approach.

  • Identified $[xxx]K in additional revenue opportunities through targeted price increases.

  • Mitigated COGS increases of [x]% by consolidating volume across multiple brands.

  • Helped leadership identify $[xx]K in redundancies by providing ad-hoc analysis to identify redundancies.

What would you prefer as hiring manager? Conversations with DS hiring managers yielded that they are almost surely going to prefer the second list, because it tells them the candidate has generated value in many different ways and times. That means more evidence that proves “I can accomplish things”.

Graduates

As a fresh graduate, it is critical is to make sure to cover all papers, reports, documents, projects, etc. Descriptions to avoid are:

  • Conducted cross-functional research in the area of [something]

And instead break that down into details:

  • Performed literature review (>20 manuscripts) in [something] design, [type of] algorithms and [something] optimization.

  • Conducted weekly research meetings with researchers in the Department of [something].

  • Prototyped novel model using combination of [language] and [framework].

  • Delivered monthly status updates to advisor.

Summary

Focus on detailed accomplishments and results. If possible, provide numbers.


Christian Haller code

Photo by @casparrubin on Unsplash

Current Skill Set vs. Learning

While it is a platitude, it is still a universal truth:

“The only constant in workplaces is change.”

For many hiring managers very high on the priority list is finding out if someone can learn exactly those things quickly they will need them to do now and in the future.

Other hiring managers ask themselves “What would this person do today?”, and then focus on finding people who are doing exactly those things today.

If the applicant pool is large (e.g., the manager needs them to code in [language]), then it is a reasonable approach to be very specific. However, if the candidate pool is small (e.g., they need people with experience in one specific algorithm), then things become difficult. This is because the odds of finding someone who is both an excellent interviewee and has that exact experience is miniscule. This is where the search principles come back to the productivity history screening. Hiring managers should put more emphasis on finding productive people than finding people with the exact experience they want.

Why?

Experience can be acquired. People can learn. Not only that, what they may have to do today may not be what they will be doing half a year in the future. Perhaps that new employee comes in and they themselves are able to suggest a new approach that works better than what was done before, rendering that skill set obsolete.

A candidates’ current skill set is largely dictated by their current job, and should not be taken as a fixed, static skill set.

Lateral Transfers

When hiring managers consider candidates that have been in a single environment for several years, then they look at the résumé and assume that what the candidate is doing now is just who they are. “Well, this applicant has done a lot of reporting as Data Analyst - not what I we are looking for. We need a Data Scientist building [some type of] models.”

And that is the wrong approach. Because what all this indicates is that candidate is doing reports, which is what they need to do in order to do their current job well.

So, what can a hiring manager do with a candidate that has done mostly Data Analysis-work but wants a position as Data Scientist training models?

Answer:

  • Measure productivity. If the applicant’s previous job was data analytics reporting, then how successful were they at that? Did they differentiate themselves in that field? Did they go above and beyond what others would have done in the role?

  • What experience does the applicant have with modeling - not in this previous role, any time before? How well can they describe that experience? How willing are they to return to it?

  • What experience does the applicant have picking up new skills? In their current role, did they have to learn new technologies or languages?

Summary

Hiring managers can take the risk-averse position and not even entertain the thought of hiring someone without the exact résumé and experience they are looking for. But they are overwhelmingly likely to miss out on some great candidates.

BASHing the Linux console

BASHing the Linux console

The Unix shell tricks for the Command Line Interface


Share:        
  1. Important Commands / Apps
  2. Globbing
  3. Compression
    1. bzip2, gzip, xy
    2. tar archives
    3. Compressed tar archives (double ending .tar.gz or .tgz)
    4. Managing zip files
  4. Cron jobs
    1. Examples
  5. Redirecting input/output/error IO
    1. Keyboard input (stdout, channel 0)
    2. Screen print (stdout, channel 1)
    3. Error (stderr, channel 2)
    4. Substituting keyboard input with file contents
    5. Combo redirects (channel 1 screen & channel 2 error)
  6. Command chains
    1. Sequential runs in a command line with ;
    2. Piping | redirect output to next command as input
    3. Redirect output only if successful
  7. Grep - searching within files
    1. Regex for grep and more
    2. Examples
  8. Writing / executing bash scripts
    1. Examples
  9. If-conditions
    1. Operators
    2. Examples
  10. Exit codes
  11. Case-conditions
  12. For-loops
    1. Examples
  13. Functions
    1. Returning Data and Arguments
  14. csvkit data processing
    1. in2csv converting files to csvkit
    2. csvlook csv preview in shell
    3. csvstat descriptive statistics
    4. csvcut column cutting
    5. csvgrep search in csv
    6. csvstack vertical stacking
    7. sql2csv load from sql database
      1. Establishing database connection
      2. Querying against the database
      3. Saving the output
    8. csvsql
      1. Write SQL queries against csv
      2. Query joined tables/csv files
      3. Upload csv to database

The following writeup contains general bash knowledge I have written down over the years and finally decided to translate into markdown.

More flags and examples are found in the internets.

Important Commands / Apps

NameName DescriptionDescriptionFlagsExample 
ls, lllist short, list longDisplay the contents of the current directory -> dir sizes fake ls [dir] 
  long description-lls -l [dir] 
  human-readable file-sizes-hls -h [dir] 
  sort by modification time-tls -t [dir] 
  all hidden paths/dirs-als -a [dir] 
  all files and dirs, no subdir contents-dls -d [dir] 
  all files and dirs, with sub-directories (recursive) similar to “find” or “tree”-Rls -R [dir] 
  sort by extension-Xls -lX [dir] 
  sort by size-Sls -S [dir] 
  reverse sorting-rls -r [dir] 
tree prints graphic representation of files and sub-directories (non-std app) tree [directory] 
  root needs sudo privileges sudo tree -F / 
  tree depth, level of recursion-Lsudo tree -L 2 / 
dudisk usageprint dir content like “ls -lh [dir]”, but with real dir sizes   
pwdprint working directoryshows currently set working directory   
cdchange directorychanges to a different directory cd [dir] 
../.. root dir   
~ logged-in user’s home directory   
. current directory   
.. parent directory   
catconcatenateview contents of single file cat [File] 
  view contents of multi files cat [File1] [File2] 
  viewing a file’s content that doesn’t exit yet creates it. cat > [File] 
cut removes sections from a text file   
cpcopycopies one or multi files, warning: if dest-file exists then it is overwritten cp [Src-file] [Dest-file] 
  directory copy (recursively)-rcp -r [Src-dir] [Dest-dir] 
  copy files to dir cp [Src-file1] [Src-file2] [Dest-dir] 
mvmovemove or rename mv [filename] [dest-dir] 
  inform overwrite-imv -i [file] [new_file] 
mkdirmake directorycreate a new directory mkdir [newdir] 
  new dir within new dir: mkdir -p [newdir1]/[newdir2] 
touch creates empty file or updates modification date touch [file1] [file2] [file3] 
  avoid create file if not exists-ctouch -c [file] 
echo print to terminal echo [string] 
  write to file (overwrites previous content) echo hello > file.txt 
wcword countcount number of words-wwc -w [file] 
  count number of lines-lwc -l [file] 
passwdpasswordchanges a user’s password   
rmremovedelete a file OR directory   
  remove dir with sub-dirs-rrm -r [myfolder] 
  remove with inform y/n prompt-irm -i [myfile] 
rmdirremove directorydeletes a directory, by default only empty dirs rmdir [directory] 
more view text files one screen at a time (superseded by ‘less’)   
less views text files, allows scrolling up and down a line or page at a time   
whereis display the file path to a specified program and related manual files   
head show the first n lines of a text file head -n 5 [file] 
tail show the last few lines of a file tail -n 5 [file] 
sort order line by line (default alphabetical) sort [file] 
  numerical order-nsort -n [file] 
  reverse order-rsort -r [file] 
trtranslatereplace or removes characters tr “[:lower:]” “[:upper:]” 
chmod changes a file permissions (read, write, execute)   
grepg/re/psearch within a file (global regular expression print)   
  case sensitive search-i  
  recursive in subfolder files-r  
  counts matches-c  
  invert, show what NOT matches-v$ grep -v “[sawgtfixk]” [file] 
  extended regex+ ?-Egrep -E “e+$” [file]
find search for file in in home dir find ~ -name [file] 
  search in current dir find . -name [file] 
  find all files ending with number find ~ -name “*[0-9]” 
locate search for file in index database (update db with sudo updatedb)   
man open manual page of a command man [command] 
info open info page of a command info [command] 
sed works on lines (modifies words or other parts of lines, inserts, deletes)   
awk work on records with fields   
which show path of an executable specified in PATH variable which netcat 
exprexpressioninteger math (not supported in native bash) expr 4 + 1 
    echo $((4+1)) 
bcbasic calculatordecimal-place math, default rounded to 0 decimal places echo “4.1 + 2.3” | bc 
  define decimal places echo “scale=3; 10/3” | bc 

Globbing

Globbing is using wildcards to select multiple files and is similar to regex (see below).

  • * any number of any character, including no characters. Example: ls question20*3
  • ? any ONE character. Example: ls question?3
  • [] class/range of characters. Example: ls file[1-3]
  • ^ excluding. Example: ls file[^a]
  • [:digit:] digits. Example: ls file[[:digit:]a]
  • [:alpha:] alphabetical
  • [:alnum:] alphabetrical + numerical (digits)
  • [:blank:] space, tabs
  • [:space:] space, tabs
  • [:cntrl:] backspace, bell, NAK, escape
  • [:lower:] lowercase
  • [:upper:] uppercase
  • [:punct:] punctuation, period, comma, semicolon, exclamation mark, ampersand
  • [:print:] all printable symbols = alphabetical + numerical + space
  • [:xdigit:]hexadecimal

Compression

bzip2, gzip, xy

bzip2 [file]` --> `bunzip2 [file2.bz2]

gzip [file]` –> `gunzip [file3.gz]

xz [file]` --> `unxz [file.xz]

Gzipping deletes the original, gunzipping deletes the zipped file

Compression levels

gzip -1 [file]` or `gzip -9 [file]

1 = weak, 9 = strong

Access files while they are still compressed

Bzip2: bzcat, bzgrep,bzdiff, bzless, bzmore

Gzip: zcat, zgrep, zdiff, zless, zmore

Xy: xycat, xygrep,xydiff, xyless, xymore

tar archives

Crating a tarball archive

tar -cf [new-tar-dir.tar] [file/dir-worked-on 1] [file/dir-worked-on 2]

c = create, f = filename

View contents of tarball

tar -tf [file.tar]

t = view

Add/update file to exiting tarball archive (only works uncompressed)

tar -uf [file.tar] [file/dir-worked-on 1] [file/dir-worked-on 2]

u = update

Extract tarball archive

tar -xf [file.tar]

x = extract

Extract single file of tarball archive

tar -xf [file.tar] [compression/hosts.gz]

x = extract with f = filename inside the tarball

Compressed tar archives (double ending .tar.gz or .tgz)

tar -czf gzip.tar.gz [file/dir-worked-on 1] [file/dir-worked-on 2]

tar -cjf bzip2.tar.bz2 [file/dir-worked-on 1] [file/dir-worked-on 2]

tar -cJf xz.tar.xz [file/dir-worked-on 1] [file/dir-worked-on 2]

c = create, z = gzip, j = bzip2, J = xy, f = filename

Unzipping .tgz

gunzip [archive.tgz]

tar xf [archive.tar]

Managing zip files

zip

zip -9 [zipfile.zip][file-worked-on] # 0-9 compression level

zip -r [zipfile.zip] [file/dir-worked-on 1] [file/dir-worked-on 2]

r = recursively all subfolders

rm -rf [file/dir-worked-on 1]

Unlike the other file formats, zipping does NOT automatically delete the uncompressed file.

unzip

unzip [zipfile.zip]

Cron jobs

crontab -l List all cron jobs

crontab -e Edit list of cron jobs, select editor

Alternative: Open crontab file with nano crontab or vim crontab.

crontab -r Delete the current cron jobs

nano /etc/crontab

* * * * * /path/to/command arg1 arg2

* * * * * /root/backup.txt
 IncrementRange
first *minutes[0-59]
second *hours[0-23]
third *day[0-31]
tourth *month[0-12]
fifth *day of week[0-7]

Examples

Example 1:

Task: Run backup script on first day of each month at 21h.

0 21 1 * * /path/to/script/backup-script.sh

Example 2:

Task: Run php script every day at 10 AM.

0 10 * * * /path/to/myphpscript.php

Example 3:

Task: Run every hour at minute 0.

0 * * * * /path/to/myphpscript.php

More on Crontab.guru

Redirecting input/output/error IO

Keyboard input (stdout, channel 0)

Screen print (stdout, channel 1)

Crate/overwrite a text file with (string) output contents using >

echo "Hello World!" > text.txt

Create/append a text file with (string) output contents using >>

echo "Hello to you too!" >> text.txt

Error (stderr, channel 2)

Create/overwrite a text containing error message using 2>

find /usr games 2> text-error.txt

If no error occurs, nothing will be saved to text file.

Create/append a text containing error message using 2>>

sort /etc 2>> text-error.txt

Redirecting to ‘/dev/null’ bit bucket (suppress output)

sort /etc 2> /dev/null

Substituting keyboard input with file contents

Input data to a command, from file instead of a keyboard. Used with commands that don’t accept file arguments.

cat < text.txt

Translate delete letter l in text

tr -d "l" < text.txt

Combo redirects (channel 1 screen & channel 2 error)

find /usr admin &> newfile.txt

Command chains

Sequential runs in a command line with ;

; links commands together and runs sequentially. Inputs and outputs are separate.

Piping | redirect output to next command as input

Redirect output to next input.

Show contents of passwd file -> make page scrollable

cat /etc/passwd | less

Show all files -> only top 10 -> word count only words

ls -l | head | wc -w

Redirect output only if successful

&& links commands together, but only runs the 2nd command if the 1st succeeds.

csvlook SpotifyData_All.csv && csvstat SpotifyData_All.csv

Grep - searching within files

grep “[regex-term]” [file]

grep “exact-string” /etc/passwd

-i case sensitive search

-r recursive in subfolder files

-c counts matches

-v invert, show what NOT matches

-E extended regex | + ?

Regex for grep and more

ExpressionDescription
.any char except newline
[abcABC]match any char in brackets
[^abcABC]match any char except the ones in brackets
[a-z]match any char in range
[^a-z]match any char except in range
sun|moonor
^start of line
$end of line
*Zero or more of the preceding pattern (requires -E)
+One or more of the preceding pattern (requires -E)
?Zero or one of the preceding pattern (requires -E)

Examples

1 All lines containing the word cat anywhere on the line.

grep "cat" /usr/share/hunspell/en_US.dic

2 All lines that do not contain (inverse) any of the following characters: sawgtfixk.

grep -v "[sawgtfixk]" /usr/share/hunspell/en_US.dic

3 All lines that start with any 3 letters and the word dig.

grep "^...dig" /usr/share/hunspell/en_US.dic

4 All lines that end with at least one e.

grep -E "e+$" /usr/share/hunspell/en_US.dic

5 All lines that contain one of the following words: org , kay or tuna.

grep -E "org|kay|tuna" /usr/share/hunspell/en_US.dic

6 Number count of lines that start with one or no c followed by the string ati.

grep -cE "^c?ati" /usr/share/hunspell/en_US.dic

Writing / executing bash scripts

All scripts should begin with a shebang, which defines the path to the interpreter (#!/bin/bash).

All scripts should include comments to describe their use (#…).

Edit scripts with cat -n [script.sh] (n=line numbers)

  • No spaces before/after = assigning variables.

  • Call variable with $ before variable ($username).

  • $# env. variable with count of arguments passed.

  • $? env. variable with error codes (0=no error).

  • $@ or #* env. variable containing ALL passed arguments.

  • $1 - $9 = explicit call of positional parameters 1-9

  • Single quotes (‘ ‘). Literal interpretation (variable substitution won’t work).

  • Double quotes (“ “). Variable substitution works. Example: “$var” -> varcontent.

  • Back ticks (` `). Content gets fetched from a sub-shell.

  • Parenthesis method $(). Content gets fetched from a sub-shell. Example: var=“Today’s date is $(date).”

  • Variables are always considered type=str -> math doesn’t work out of the box.

echo 'echo "Hello World!"' > new_script.sh

Specify interpreter in first line of script -> comment with shebang! (bash folder)

echo '#!/bin/bash \n echo "Hello World!"' > new_script.sh

Add access permissions to script file + execute script

chmod +x [new_script.sh]

Examples

Single quotes vs double quotes

#!/bin/bash

username=”Carol Smith”
echo "Hello $username!"
echo ‘Hello $username!

Hello Carol Smith! Hello $username!

Script Arguments

Input: new_script.sh Carol Dave

#!/bin/bash

username1=$1
username2=$2
echo "Hello $username1 and $username2!"

Hello Carol and Dave!

Number of Script Arguments ($#)

Input: new_script.sh Carol Dave

#!/bin/bash

username=$1
echo "Hello $username!"
echo "Number of arguments: $#."

Hello Carol! Number of arguments: 2.

Shell-within-a-shell calculation

#!/bin/bash

model1=87.56
model2=89.20
echo “Total score is $(echo $model1 + $model2 | bc)echo “Average score is $(echo ‘scale=2; ($model1 + $model2) / 2’ | bc)

Converting temperature units (pass °F as parameter)

Input: tempconverter.sh 84

#!/bin/bash

temp_f=$1
temp_f2=$(echo "scale=2; $temp_f - 32" | bc)
temp_c=$(echo "scale=2; $temp_f2 * 5 / 9" | bc)
echo $temp_c

28

Script creating variables from files

Input: pull_temp_files.sh

#!/bin/bash

temp_a=$(cat temps/region_A)
temp_b=$(cat temps/region_B)
temp_c=$(cat temps/region_C)
echo "The three temperatures were $temp_a, $temp_b, and $temp_c"

The three temperatures were 45, 22, and 32

If-conditions

Operators

Scalar comparison operators

-eq equal (=)

-ne not equal (!=)

-gt greater than (>)

-ge greater or equal (>=)

-lt less than (<)

-le less or equal(<=)

String Comparison Operators

abc == abc true

abc == ABC false

1 == 1 true

1+1 == 2 false

Examples

Example 1

If one argument is passed [ $# -eq 1 ], then echo greeting, else raise error, finish if, print $#

Input: new_script.sh Carol

#!/bin/bash

if [ $# -eq 1 ]
then
	username=$1
	echo "Hello $username!"
else
	echo "Please enter only one argument."
fi
echo "Number of arguments: $#."

Hello Carol! Number of arguments: 1.

Example 2

Input: guided1.sh 3 0

#!/bin/bash

fruit1=Apples
fruit2=Oranges

if [ $1 -lt $# ]
then
	echo "This is like comparing $fruit1 and $fruit2!"
elif [ $1 -gt $2 ]
then
	echo "$fruit1 win!"
else
	echo "$fruit2 win!"
fi

Apples win!

Exit codes

Make your own exit codes and end script there

Input: script.sh Carol

#!/bin/bash

# A simple script to greet a single user.
if [ $# -eq 1 ]
then
	username=$1
	echo "Hello $username!"
	exit 0
else
	echo "Please enter only one argument."
	exit 1
fi
echo "Number of arguments: $#."  #never executed, since exited before

Output:

Hello Carol!

echo $? check exit code

0

Case-conditions

Indsead of long chains of if/else-statements, case-statements are simpler to write.

Feed in filename as script argument ($1)

#!/bin/bash

case $1 in
	# Match on all months with OR pipes
	January|March|May|July|September|November)
	echo "It is an odd month!";;
  
	# Match on all even months
	February|April|June|August|October|December)
	echo "It is an even month!";;
  
	# Default case if no match (i.e., else)
	*)
	echo "Not a month!";;
esac

Feed in filename as script argument ($1) cat through the file contents and move or remove the files.

#!/bin/bash

case $(cat $1) in
	*Frankfurt*)
	mv $1 frankfurt/ ;;

	*Berlin*)
	rm $1 ;;

	*Hamburg*)
	mv $1 "SPECIAL_$1" ;;
	
	# Defaultcase if no match
	*)
	echo "No city found" ;;
esac

For-loops

Pattern: `for [indexing_var] in [iterable] do [something] done

Examples

Example 1

#!/bin/bash

FILES="/usr/sbin/accept /usr/sbin/pwck/ usr/sbin/chroot"

for file in $FILES
do
	ls -lh $file
done

Example 2

Loop through parameters and print each parameter

Input: friendly2.sh Carol Dave Henry

#!/bin/bash

# if there are no parameters passed
if [ $# -eq 0 ]		
then
	echo "Please enter at least one user to greet."
	exit 1
else
	for username in $@
	do
		echo "Hello $username!"
	done
	exit 0
fi

Output:

Hello Carol! Hello Dave! Hello Henry!

Example 3

Loop through parameters and check for alphabetical symbols

Input: friendly2.sh Carol Dave Henry

#!/bin/bash

if [ $# -eq 0 ]
then
	echo "Please enter at least one user to greet."
	exit 1
else
	for username in $@
	do
		echo $username | grep "^[A-Za-z]*$" > /dev/null
		# if  the line above produced an error code
		if [ $? -eq 1 ]  
		then
			echo "ERROR: Names must only contain letters."
			exit 2
		else
			echo "Hello $username!"
		fi
	done
	exit 0
fi

Output:

Hello Carol! Hello Dave! Hello Henry!

friendly2.sh 42 Carol Dave Henry

ERROR: Names must only contain letters. echo $? check number of passed arguments 2

Example 4

A script that will take any number of arguments from the user, and print only those arguments which are numbers greater than 10.

#!/bin/bash

for arg in $@
do
	# strip all numerical symbols
	echo $arg | grep "^[0-9]*$" > /dev/null
	
	# if the above produced NO error code
	if [ $? -eq 0 ]								
	then
		if [ arg -lt 10 ]
		then
			echo $arg
		fi
	fi
done

Example 5

A script combining for-loop and case-statements to sort files into folders based on contents or remove them.

#!/bin/bash

# loop through files in folder
for file in model_output/*
do
    # cat into file's content
    case $(cat $file) in

		# Matched tree models are moved
		*"Random Forest"*|*GBM*|*XGBoost*)
		mv $file tree_model/ ;;
		
		# Other Models are deleted/removed
		*KNN*|*Logistic*)
		rm $file ;;

		# Default case if no match (i.e., else)
		*) 
		echo "Unknown model in $file" ;;
    esac
done

Functions

A function with no parameters gets set up with the keyword function and curly braces.

Variables defined within a function are always in global scope, unlike other programming languages. Instead, they have to be declared as local.

The return option is not for returning data, but for error codes (0 = no error, 1-255 = failure). To get data out of a function, the variable can be echoed and/or declared a global variable.

Example 1

function cloud_load () {
	# loop through all files
	for file in output_dir/*results*
	do
		# do something
		echo "Uploading $file to cloud"
	done
}

# call
cloud_load

Example 2

However, it also works without the keyword.

what_day_is_it () {

	# Parse date
	local 	current_day=$(date | cut -d " " -f1)

	echo $current_day
}

# call
what_day_is_it

Returning Data and Arguments

Arguments for functions are handled with $1, $2$@, $# as with shell scripts. They do not get declared when writing the function, but all get passed as positional parameters.

function return_percentage () {

  # Do calculation on passed parameters
  percent=$(echo "scale=2; 100 * $1 / $2" | bc)

  # print the data
  echo $percent
}

# Call function
return_test=$(return_percentage 410 55)
echo "The percentage is $return_test%"

The percentage is 745.45%

csvkit data processing

csv kit can do csv processing similarl to Pandas data frames.

Install using Python pip

pip install csvkit

in2csv converting files to csvkit

Print excel spreadsheet

in2csv My_data.xlsx

Print excel sheet names with flag -n

in2csv -n My_Data.xlsx

Excel to csv

in2csv My_Data.xlsx > My_Data.csv

Excel sheet to csv

in2csv My_Data.xlsx --sheet "Worksheet1" > My_Data_worksheet1.csv

csvlook csv preview in shell

Print a csv like a markdown table.

csvlook My_data.csv
idsize
118GQ70Sp6pMqn6w1oKuki7
6S7cr72a7a8RVAXzDCRj6m7
7h2qWpMJzIVtiP30E8VDW47
3KVQFxJ5CWOcbxdpPYdi4o7
0JjNrI1xmsTfhaiU1R6OVc7
3HjTcZt29JUHg5m60QhlMw7

csvstat descriptive statistics

Similar to the pandas command pandas.DataFrame.describe

csvstat My_data.csv
  1. “id” Type of data: Text Contains null values: False Unique values: 24 Longest value: 22 characters Most common values: 118GQ70Sp6pMqn6w1oKuki (1x) 6S7cr72a7a8RVAXzDCRj6m (1x)

csvcut column cutting

Print column names with flag -n

csvcut -n My_Data.csv

Print first, third, fifth columns by positional argument

csvcut -c 1,3,5 My_Data.csv

Print a column by name

csvcut -c "id","size" My_Data.csv

csvgrep search in csv

-m exact row value to filter

-c select column

csvgrep -c "id" -m 5RCPsfzmEpTXMCTNk7wEfQ My_Data.csv

-r regex pattern

-f path to a file/filename

csvstack vertical stacking

csvstack My_data1.csv My_data2.csv > My_data_complete.csv

csvlook My_data_complete.csv
idsize
7JYCpIzpoidDOnnmxmHwtj6
0mmFibEg5NuULMwTVN2tRU6
118GQ70Sp6pMqn6w1oKuki7
6S7cr72a7a8RVAXzDCRj6m7

Add a group column that traces the original file with flag -g and a group name with flag -n

csvstack -g "data1","data2" -n "source" My_data1.csv My_data2.csv > My_data_complete.csv

csvlook My_data_complete.csv
sourceidsize
data17JYCpIzpoidDOnnmxmHwtj6
data10mmFibEg5NuULMwTVN2tRU6
data2118GQ70Sp6pMqn6w1oKuki7
data26S7cr72a7a8RVAXzDCRj6m7

sql2csv load from sql database

Executes an SQL query on a large variety of SQL databases (e.g., MS SQL, MySQL, Oracle, PostgreSQL, Sqlite) and outputs result to csv.

Establishing database connection

--db is followed by the database connection string

SQLite: starts with sqlite:/// and ends with .db

Postgres: starts with postgres:/// and no .db

MySQL: starts with mysql:/// and with no .db

Querying against the database

--query is followed by the SQL query string

Use SQL syntax compatible with the database

Write query in one line with no line breaks

Saving the output

> re-directs output to new local csv

sql2csv --db "sqlite:///TwitterDatabase.db" \
        --query "SELECT * FROM Twitter_Size" \
        > Twitter_size.csv

csvsql

Write SQL queries against csv

Applies SQL statements to one or more csv. Creates an in-memory SQL database that temporarily hosts the Suitable for small to medium files only

--query contains the sql statement

csvsql --query "SELECT * FROM Twitter_data LIMIT 1" \
       data/Twitter_data.csv | csvlook

Query joined tables/csv files

The sql query must be written in one line without breaks.

The order of files in the query and the csv files after that must match.

csvsql --query "SELECT * FROM file_a INNER JOIN file_b" file_a.csv file_b.csv

Upload csv to database

Create tables and insert into tables.

--insert csv file to insert into table --db address --no-inference Disable type inference when parsing the input --no-constraints Generate a schema without length limits or null checks

csvsql --no-inference --no-constraints \
       --db "sqlite:///TwitterDatabase.db" \
       --insert Twitter_data.csv

Speedy Python

Speedy Python

Start measuring, find bottlenecks.


Share:        
  1. Measurement Tools
    1. Timeit
    2. Line Profiler
    3. Memory Profiler
    4. Timer Decorator
  2. Examples, slow and fast.
    1. Combining items of two lists (loop vs zip)
    2. Counting items on a list (loop vs Counter function)
    3. Number of combinations on a list (nested loop vs Combinations function)
    4. Replace loops with built-in functions (zip, map)
  3. My function runs for too long…

Writing code a programming language is very flexible. Many solutions will lead you to a goal. But not all the ways of are equally quick in computation and parsimonious in memory allocation. For many applications you won’t really bother with using speedier commands. However, this becomes a consideration when dealing with large datasets (aka Big Data) or you prefer to use serverless services that charge by computation time (e.g., AWS Lambdas). How do we find out what command is more efficient and what is less efficient? You can measure!

Measurement Tools

  • Timeit: line magic (%timeit) and cell magic (%%timeit). Timeit line magic runs only a single line, cell magic uses the entire cell. Benefits of timeit is that repetitions of the command can be scheduled and a more reliable average is delivered.

  • Line_Profiler, a module that accepts a function and returns a line-by-line analysis how long each line took to execute and its proportion of overall time. A great way to find bottleneck commands in a function.

  • Memory_Profiler, a memory that accepts a function (imported from a .py file) and delivers a line-by-line analysis of memory usage in MiB.

If you want to know more about magic commands and code measurement, check out these articles:

An example Jupyter Notebook with all the code of this post can be found here

Timeit

Line Magic

Line magic is used for short one-liner measuring.

For the first use, execute %load_ext line_profiler to load the module.

Default parameters:

%timeit lambda: "-".join(map(str, range(10000)))

44 ns ± 1.45 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

Setting flags to set custom parameters for repetitions and number of runs:

%timeit -r 10 -n 1000 lambda: "-".join(map(str, range(10000)))

40.9 ns ± 0.134 ns per loop (mean ± std. dev. of 10 runs, 1000 loops each)

Cell Magic

Cell Magic is used to measure multi-line scripts. The same flags can be set for cell magic.

%%timeit
total = 0
for i in range(100):
    for j in range(100):
        total += i * (-1) ** j

3.33 ms ± 71.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Line Profiler

The Line Profiler offers a total run time and additionally detailed measurements how much time each line took to execute, how many times it was executed, and what proportion of the total each line makes up.

%load_ext line_profiler

%lprun -f convert_units_list convert_units_list(nameList, HTList, WTList)
Timer unit: 1e-07 s

Total time: 0.213249 s
File: <ipython-input-7-b143edde93f5>
Function: convert_units_list at line 2

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     2                                           def convert_units_list(names, heights, weights):
     3         1     184181.0 184181.0      8.6      new_hts = [ht * 0.39370  for ht in heights]
     4         1     171217.0 171217.0      8.0      new_wts = [wt * 2.20462  for wt in weights]
     5         1         21.0     21.0      0.0      people_data = {}
     6    100001     883878.0      8.8     41.4      for i,name in enumerate(names):
     7    100000     893188.0      8.9     41.9          people_data[name] = (new_hts[i], new_wts[i])
     8         1          7.0      7.0      0.0      return people_data

Memory Profiler

Use and settings of the Memory Profiler is very similar to the Line Profiler. The “Increment” column shows memory volume gained in each line. The Memory Profiler only works with functions that are imported from .py files in the working directory! If the function is too small to measure, it will show “0.0 MiB”.

from conv_list import convert_units_list

%load_ext memory_profiler

%mprun -f convert_units_list convert_units_list(nameList, HTList, WTList)
Filename: C:\Users\ChristianV700\Documents\GitHub\Python_coding\speedy_python\conv_list.py

Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
     1     76.8 MiB     76.8 MiB           1   def convert_units_list(names, heights, weights):
     2     81.1 MiB      4.3 MiB      100003       new_hts = [ht * 0.39370  for ht in heights]
     3     85.7 MiB      4.6 MiB      100003       new_wts = [wt * 2.20462  for wt in weights]
     4     85.7 MiB      0.0 MiB           1       people_data = {}
     5     96.9 MiB      6.1 MiB      100001       for i,name in enumerate(names):
     6     96.9 MiB      5.0 MiB      100000           people_data[name] = (new_hts[i], new_wts[i])
     7     96.9 MiB      0.0 MiB           1       return people_data

Timer Decorator

The decorator is used to automate the timing of a function and we do not have to repeat the timing logic over and over.

import time
from functools import wraps

def timer(func):
	'''A decorator that prints how long a function took to run.

		Args:
			func (callable): The function being decorated.
	
		Returns:
			callable: The decorated function.
	'''
	
	@wraps(func)
	def wrapper(*args, **kwargs):
		t_start = time.time()
		
		# things are happening here
		result = func(*args, **kwargs)
		
		duration = time.time() - t_start
		print('{} took {} seconds'.format(func.__name__, duration))
		
		return result	
	
	return wrapper
	
# Make use of the decorator with a function we are evaluating (modify as needed)
@timer
def slow_function(n)
	'''Pause processing for n seconds.
	
		Args: 
			n (int): the number of seconds to pause for.
	'''
	# things are happening here
	time.sleep(n)

slow_function(10)

slow_function took 10 seconds

Examples, slow and fast.

Combining items of two lists (loop vs zip)

%%timeit
combined = []
for i, name in enumerate(nameList):
    combined.append((name, HTList[i]))

17.8 ms ± 1.07 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
combined_zip = zip(nameList, HTList)

194 ns ± 1.06 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

Summary: The zip function is multiple magnitudes faster than the loop.

Counting items on a list (loop vs Counter function)

The Counter function is part of the ‘collections’ package.

%%timeit
height_counts = {}
for height in HTList:
    if height not in height_counts:
        height_counts[height] = 1
    else:
        height_counts[height] += 1

12.4 ms ± 382 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
height_counts = Counter(HTList)

4.96 ms ± 26.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Summary: Counter takes only about a third of the loop.

Number of combinations on a list (nested loop vs Combinations function)

%%timeit
for x in nameList[:100]:
    for y in nameList[:100]:
        if x == y:
            continue
        if ((x,y) not in pairs) & ((y,x) not in pairs):
            pairs.append((x,y))

1.98 s ± 23.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
combinations(nameList[:100],2)

724 ns ± 6.85 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Summary: The difference here is most staggering: almost two seconds for the nested loops and less than one millisecond for the function.

Replace loops with built-in functions (zip, map)

%%timeit
loop_output = []
for name,weight in zip(nameList, WTList):
    if weight < 100:
        name_length = len(name)
        tuple = (name, name_length)
        loop_output.append(tuple)

11.7 ms ± 128 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
filtered_name_list = [name for name,weight in zip(nameList, WTList) if weight > 100]
name_lengths_map = map(len, filtered_name_list)
output = [*zip(filtered_name_list, name_lengths_map)]

8.95 ms ± 280 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

My function runs for too long…

What if you are having trouble with a function executing for longer than expected?

You can have a timeout alarm set if it runs longer than a set duration of seconds.

The example below makes use of a decorator factory that produced decorators with a variable time for the timeout (seconds).

from functools import wraps
import signal
import time

def timeout(n_seconds):
	def decorator(func):
		@wraps(func)
		def wrapper(*args, **kwargs):
			#set alarm timer
			signal.alarm(n_seconds)
			try:
				#run the desired function
				return func(*args, **kwargs)
			finally:
				#cancel the alarm in any case
				signal.alarm(0)
		return wrapper
	return decorator

#produce a decorator it with sleep example that will trigger
@timeout(5)
def trialfunction()
	time.sleep(10)

trialfunction()

TimeoutError

Equinor Volve Log ML

Equinor Volve Log ML

Work with real Operator Data!


Share:        
  1. Background
  2. Data
  3. Problem
  4. Goal
  5. Processing
    1. Imports
    2. Load data
    3. Data Preparation
    4. Exploratory Data Analysis
      1. Pair-plot of the Train Data
      2. Spearman’s Correlation Heatmap
    5. Transformation of the Train Data
      1. Pair-Plot (after transformation)
    6. Removing Outliers
      1. Method 1: Standard Deviation
      2. Method 2: Isolation Forest
      3. Method 3: Minimum Covariance Determinant
      4. Method 4: Local Outlier Factor
      5. Method 5: Support Vector Machine
    7. Train and Validate
      1. Fit to Test and Score on Val
      2. Inverse Transformation of Prediction
    8. Hyperparameter Tuning
    9. Predict Test Wells
      1. Define the Test Data
      2. Transform Test - Predict - Inverse Transform
      3. Plot the Predictions

This repository is my exploration on bringing machine learning to the Equinor “Volve” geophysical/geological dataset, which was opened up to the public in 2018.

For more information about this open data set, its publishing licence, and how to obtain the original version please visit Equinor’s Data Village.

The full dataset contains ~ 40,000 files and the public is invited to learn from it. It is on my to-do-list to explore more data from this treasure trove. In fact, the data is not very much explored yet in short the time frame since Volve became open data and likely still holds many secrets. The good thing is that certain data workflows can be re-used from log to log, well to well, etc. So see this as a skeleton blue print with ample space for adaptation as it fits your goals. Some people may even create their bespoke Object-Oriented workflow for processsing automation and best time savings in repetitious work such as this large data set.

The analysis in this post was inspired in parts by the geophysicist Yohanes Nuwara (see here). He also wrote a TDS article see here on the topic, that’s worth your time if you want to dive in deeper!

More on the topic:

The full repository including Jupyter Notebook, data, and results of what you see below can be found here.

Background

The Volve field was operated by Equinor in the Norwegian North Sea between 2008—2016. Located 200 kilometres west of Stavanger (Norway) at the southern end of the Norwegian sector, was decommissioned in September 2016 after 8.5 years in operation. Equinor operated it more than twice as long as originally planned. The development was based on production from the Mærsk Inspirer jack-up rig, with Navion Saga used as a storage ship to hold crude oil before export. Gas was piped to the Sleipner A platform for final processing and export. The Volve field reached a recovery rate of 54% and in March 2016 when it was decided to shut down its production permanently. Reference.

Data

Wireline-log files (.LAS) of five wells:

  • Log 1: 15_9-F-11A
  • Log 2: 15_9-F-11B
  • Log 3: 15_9-F-1A
  • Log 4: 15_9-F-1B
  • Log 5: 15_9-F-1C

The .LAS files contain the following feature columns:

NameUnitDescriptionRead More
Depth[m]Below Surface 
NPHI[vol/vol]Neutron Porosity (not not calibrated in basic physical units)Reference
RHOB[g/cm3]Bulk DensityReference
GR[API]Gamma Ray radioactive decay (aka shalyness log)Reference
RT[ohm*m]True ResistivityReference
PEF[barns/electron]PhotoElectric absorption FactorReference
CALI[inches]Caliper, Borehole DiameterReference
DT[μs/ft]Delta Time, Sonic Log, P-wave, interval transit timeReference

Problem

Wells 15/9-F-11B (log 2) and 15/9-F-1C (log 5) lack the DT Sonic Log feature.

Goal

Predict Sonic Log (DT) feature in these two wells.

Processing

Imports

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import itertools

import lasio

import glob
import os
import md_toc

Load data

LAS is an ASCII file type for borehole logs. It contains:

  1. the header with detailed information about a borehole and column descriptions and
  2. the main body with the actual data.

The package LASIO helps parsing and writing such files in python. Reference: https://pypi.org/project/lasio/

# Find paths to the log files (MS windows path style)
paths = sorted(glob.glob(os.path.join(os.getcwd(),"well_logs", "*.LAS")))

# Create a list for loop processing
log_list = [0] * len(paths)

# Parse LAS with LASIO to create pandas df
for i in range(len(paths)):
  df = lasio.read(paths[i])
  log_list[i] = df.df()
  # this transforms the depth from index to regular column
  log_list[i].reset_index(inplace=True)

log_list[0].head()
 DEPTHABDCQF01ABDCQF02ABDCQF03ABDCQF04BSCALIDRHODTDTSPEFRACEHMRACELMRDRHOBRMROPRPCEHMRPCELMRT
0188.5NaNNaNNaNNaN36.0NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
1188.6NaNNaNNaNNaN36.0NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
2188.7NaNNaNNaNNaN36.0NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
3188.8NaNNaNNaNNaN36.0NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
4188.9NaNNaNNaNNaN36.0NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
# Save logs from list of dfs into separate variables
log1, log2, log3, log4, log5 = log_list

# Helper function for repeated plotting

def makeplot(df,suptitle_str="pass a suptitle"):

  # Lists of used columns and colors
  columns = ['NPHI', 'RHOB', 'GR', 'RT', 'PEF', 'CALI', 'DT']
  colors = ['red', 'darkblue', 'black', 'green', 'purple', 'brown', 'turquoise']

  # determine how many columns are available in the log (some miss 'DT')
  col_counter = 0
  for i in df.columns:
    if i in columns:
      col_counter+=1

  # Create the subplots
  fig, ax = plt.subplots(nrows=1, ncols=col_counter, figsize=(col_counter*2,10))
  fig.suptitle(suptitle_str, size=20, y=1.05)

  # Looping each log to display in the subplots
  for i in range(col_counter):
    if i == 3:
      # semilog plot for resistivity ('RT')
      ax[i].semilogx(df[columns[i]], df['DEPTH'], color=colors[i])
    else:
      # all other -> normal plot
      ax[i].plot(df[columns[i]], df['DEPTH'], color=colors[i])
  
    ax[i].set_title(columns[i])
    ax[i].grid(True)
    ax[i].invert_yaxis()

  plt.tight_layout() #avoids label overlap
  plt.show()

makeplot(log1,"Log 1 15_9-F-11A")


log 1 exploration

makeplot(log2, "Log 2 15_9-F-11B")


log 2 exploration

Data Preparation

  1. The train test split is easy, the data is already partitioned by wells:
    • Train on logs 1, 3, and 4.
    • Test (validation) on logs 2 and 5.
  2. There are many NAN values in the logs. The plots above only display the samples that are non-NaN and thus can be used to gauge where they need to be clipped. The NaN are primarily present on top and bottom of each log before readings start. The logs get clipped around the following depths:
    • log1 2,600 - 3,720 m
    • log2 3,200 - 4,740 m
    • log3 2,620 - 3,640 m
    • log4 3,100 - 3,400 m
    • log5 3,100 - 4,050 m
  3. Furthermore, the logs contain many more featurs than we need. The correct will get selected for further use, the rest gets discarded.
# Lists of depths for clipping
lower = [2600, 3200, 2620, 3100, 3100]
upper = [3720, 4740, 3640, 3400, 4050]

# Lists of selected columns
train_cols = ['DEPTH', 'NPHI', 'RHOB', 'GR', 'RT', 'PEF', 'CALI', 'DT']
test_cols = ['DEPTH', 'NPHI', 'RHOB', 'GR', 'RT', 'PEF', 'CALI']

log_list_clipped = [0] * len(paths)

for i in range(len(log_list)):
    
  # Clip depths
  temp_df = log_list[i].loc[
      (log_list[i]['DEPTH'] >= lower[i]) & 
      (log_list[i]['DEPTH'] <= upper[i])
  ]

  # Select train-log columns
  if i in [0,2,3]:
    log_list_clipped[i] = temp_df[train_cols]
  
  # Select test-log columns
  else:
    log_list_clipped[i] = temp_df[test_cols]

# Save logs from list into separate variables
log1, log2, log3, log4, log5 = log_list_clipped
# check for NaN
log1
 DEPTHNPHIRHOBGRRTPEFCALIDT
241152600.00.3712.35682.7481.3237.1268.648104.605
241162600.10.3412.33879.3991.1966.6548.578103.827
241172600.20.3082.31574.2481.1716.1058.578102.740
241182600.30.2832.29168.5421.1425.6138.547100.943
241192600.40.2722.26960.3141.1075.2818.52398.473
353113719.60.2362.61770.1911.6277.4388.70384.800
353123719.70.2382.59575.3931.5137.2588.75085.013
353133719.80.2362.57182.6481.4207.0768.76685.054
353143719.90.2172.54489.1571.3496.9568.78184.928
353153720.00.2262.52090.8981.3016.9208.78184.784

Next Steps:

  • Concatenate the training logs into a training df and the test logs into a test df

  • Assign log/well names to each sample.

  • Move column location to the right.

# Concatenate dataframes
train = pd.concat([log1, log3, log4])
pred = pd.concat([log2, log5])

# Assign names
names = ['15_9-F-11A', '15_9-F-11B', '15_9-F-1A', '15_9-F-1B', '15_9-F-1C']

names_train = []
names_pred = []

for i in range(len(log_list_clipped)):
  if i in [0,2,3]:
    # Train data, assign names 
    names_train.append(np.full(len(log_list_clipped[i]), names[i]))
  else:
    # Test data, assign names
    names_pred.append(np.full(len(log_list_clipped[i]), names[i]))

# Concatenate inside list
names_train = list(itertools.chain.from_iterable(names_train))
names_pred = list(itertools.chain.from_iterable(names_pred))

# Add well name to df
train['WELL'] = names_train
pred['WELL'] = names_pred

# Pop and add depth to end of df
depth_train, depth_pred = train.pop('DEPTH'), pred.pop('DEPTH')
train['DEPTH'], pred['DEPTH'] = depth_train, depth_pred

# Train dataframe with logs 1,3,4 vertically stacked
train
 NPHIRHOBGRRTPEFCALIDTWELLDEPTH
241150.37102.356082.74801.32307.12608.6480104.605015_9-F-11A2600.0
241160.34102.338079.39901.19606.65408.5780103.827015_9-F-11A2600.1
241170.30802.315074.24801.17106.10508.5780102.740015_9-F-11A2600.2
241180.28302.291068.54201.14205.61308.5470100.943015_9-F-11A2600.3
241190.27202.269060.31401.10705.28108.523098.473015_9-F-11A2600.4
325370.18612.457160.43921.23375.98948.722775.394715_9-F-1B3399.6
325380.18402.459661.84521.24526.09608.697675.340415_9-F-1B3399.7
325390.17982.463761.13861.29606.16288.697675.329815_9-F-1B3399.8
325400.17802.471459.37511.40606.15208.697675.354115_9-F-1B3399.9
325410.17602.480958.37421.45296.10618.697875.447615_9-F-1B3400.0
# Pred dataframe with logs 2, 5 verically stacked
pred
 NPHIRHOBGRRTPEFCALIWELLDEPTH
301150.07502.60509.34808.33107.45108.547015_9-F-11B3200.0
301160.07702.60209.36208.28907.46408.547015_9-F-11B3200.1
301170.07802.59909.54508.24707.40508.547015_9-F-11B3200.2
301180.07902.594011.15308.20607.29208.547015_9-F-11B3200.3
301190.07802.589012.59208.16507.16708.547015_9-F-11B3200.4
390370.31072.4184106.76132.69506.23328.556915_9-F-1C4049.6
390380.29972.4186109.03362.61976.25398.556915_9-F-1C4049.7
390390.29302.4232106.09352.59486.28838.557015_9-F-1C4049.8
390400.28922.4285105.49312.63446.34008.605615_9-F-1C4049.9
390410.29562.4309109.89652.64596.39988.556915_9-F-1C4050.0

Exploratory Data Analysis

Pair-plot of the Train Data

train_features = ['NPHI', 'RHOB', 'GR', 'RT', 'PEF', 'CALI', 'DT']

sns.pairplot(train, vars=train_features, diag_kind='kde',
             plot_kws = {'alpha': 0.6, 's': 30, 'edgecolor': 'k'})


pairplot

Spearman’s Correlation Heatmap

train_only_features = train[train_features]

# Generate a mask for the upper triangle
mask = np.zeros_like(train_only_features.corr(method = 'spearman') , dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

# Custom colormap
cmap = sns.cubehelix_palette(n_colors=12, start=-2.25, rot=-1.3, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
plt.figure(figsize=(12,10))
sns.heatmap(train_only_features.corr(method = 'spearman'), annot=True,  mask=mask, cmap=cmap, vmax=.3, square=True)

plt.show()


heatmap

Transformation of the Train Data

Normalize/transform the dataset:

  • Log transform the RT log
  • Use power transform with Yeo-Johnson method (except ‘WELL’ and ‘DEPTH’)
# Log transform the RT to logarithmic
train['RT'] = np.log10(train['RT'])

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import PowerTransformer

# Transformation / Normalizer object Yeo-Johnson method
scaler = PowerTransformer(method='yeo-johnson')

# ColumnTransformer (feature_target defines to which it is applied, leave Well and Depth untouched)
ct = ColumnTransformer([('transform', scaler, feature_target)], remainder='passthrough')

# Fit and transform
train_trans = ct.fit_transform(train)

# Convert to dataframe
train_trans = pd.DataFrame(train_trans, columns=colnames)
train_trans
 NPHIRHOBGRRTPEFCALIDTWELLDEPTH
01.702168-0.9207481.130650-0.6318760.0310830.4500191.58838015_9-F-11A2600.0
11.573404-1.0206211.092435-0.736154-0.373325-1.0708481.56234915_9-F-11A2600.1
21.407108-1.1424931.030314-0.758080-0.819890-1.0708481.52505515_9-F-11A2600.2
31.260691-1.2630780.956135-0.784153-1.197992-1.7536411.46093415_9-F-11A2600.3
41.189869-1.3679690.837247-0.816586-1.441155-2.2862211.36743215_9-F-11A2600.4
243980.462363-0.2793510.839177-0.704005-0.9106192.0417080.04794115_9-F-1B3399.6
243990.439808-0.2616210.860577-0.694407-0.8269951.5104340.04346615_9-F-1B3399.7
244000.393869-0.2323350.849885-0.653120-0.7740931.5104340.04259115_9-F-1B3399.8
244010.373838-0.1766280.822640-0.569367-0.7826721.5104340.04459615_9-F-1B3399.9
244020.351335-0.1066090.806807-0.535769-0.8190211.5146820.05229215_9-F-1B3400.0

Pair-Plot (after transformation)

sns.pairplot(train_trans, vars=feature_target, diag_kind = 'kde',
             plot_kws = {'alpha': 0.6, 's': 30, 'edgecolor': 'k'})


pairplot transformed

Removing Outliers

Outliers can be removed easily with an API call. The issue is that there are multiple methods that may work with varying efficiency. Let’s run a few and select which one performs best.

from sklearn.ensemble import IsolationForest
from sklearn.covariance import EllipticEnvelope
from sklearn.neighbors import LocalOutlierFactor
from sklearn.svm import OneClassSVM

# Make a copy of train
train_fonly = train_trans.copy()

# Remove WELL, DEPTH
train_fonly = train_fonly.drop(['WELL', 'DEPTH'], axis=1)
train_fonly_names = train_fonly.columns

# Helper function for repeated plotting

def makeboxplot(my_title='enter title',my_data=None):
    _, ax1 = plt.subplots()
    ax1.set_title(my_title, size=15)
    ax1.boxplot(my_data)
    ax1.set_xticklabels(train_fonly_names)
    plt.show()
	
makeboxplot('Unprocessed',train_trans[train_fonly_names])
print('n samples unprocessed:', len(train_fonly))


m0 unprocessed

n samples unprocessed: 24,403

Method 1: Standard Deviation

train_stdev = train_fonly[np.abs(train_fonly - train_fonly.mean()) <= (3 * train_fonly.std())]

# Delete NaN
train_stdev = train_stdev.dropna()

makeboxplot('Method 1: Standard Deviation',train_stdev)
print('Remaining samples:', len(train_stdev))


m1 StDev

Remaining samples: 24,101

Method 2: Isolation Forest

iso = IsolationForest(contamination=0.5)
yhat = iso.fit_predict(train_fonly)
mask = yhat != -1
train_iso = train_fonly[mask]

makeboxplot('Method 2: Isolation Forest',train_iso)
print('Remaining Samples:', len(train_iso))


m2 Iso Forest

Remaining Samples: 12,202

Method 3: Minimum Covariance Determinant

ee = EllipticEnvelope(contamination=0.1)
yhat = ee.fit_predict(train_fonly)
mask = yhat != -1
train_ee = train_fonly[mask]

makeboxplot('Method 3: Minimum Covariance Determinant',train_ee)
print('Remaining samples:', len(train_ee))


m3 MinCovDet

Remaining samples: 21,962

Method 4: Local Outlier Factor

lof = LocalOutlierFactor(contamination=0.3)
yhat = lof.fit_predict(train_fonly)
mask = yhat != -1
train_lof = train_fonly[mask]

makeboxplot('Method 4: Local Outlier Factor',train_lof)
print('Remaining samples:', len(train_lof))


m4 Local Outlier Factor

Remaining samples: 17,082

Method 5: Support Vector Machine

svm = OneClassSVM(nu=0.1)
yhat = svm.fit_predict(train_fonly)
mask = yhat != -1
train_svm = train_fonly[mask]

makeboxplot('Method 5: Support Vector Machine',train_svm)
print('Remaining samples:', len(train_svm))


m5 Support Vector Machine

Remaining samples: 21,964

One-class SVM performs best

Make pair-plot of data after outliers removed.

sns.pairplot(train_svm, vars=feature_target,
             diag_kind='kde',
             plot_kws = {'alpha': 0.6, 's': 30, 'edgecolor': 'k'})


pairplot after outlier removal

Train and Validate

Define the train data as the SVM outlier-removed-data.

# Select columns for features (X) and target (y)
X_train = train_svm[feature_names].values
y_train = train_svm[target_name].values

Define the validation data 
train_trans_copy = train_trans.copy()

train_well_names = ['15_9-F-11A', '15_9-F-1A', '15_9-F-1B']

X_val = []
y_val = []

for i in range(len(train_well_names)):
  # Split the df by log name
  val = train_trans_copy.loc[train_trans_copy['WELL'] == train_well_names[i]]

  # Drop name column 
  val = val.drop(['WELL'], axis=1)

  # Define X_val (feature) and y_val (target)
  X_val_, y_val_ = val[feature_names].values, val[target_name].values
  
  X_val.append(X_val_)
  y_val.append(y_val_)

# save into separate dfs
X_val1, X_val3, X_val4 = X_val
y_val1, y_val3, y_val4 = y_val

Fit to Test and Score on Val

from sklearn.metrics import mean_squared_error
from sklearn.ensemble import GradientBoostingRegressor

# Gradient Booster object
model = GradientBoostingRegressor()

# Fit the regressor to the training data
model.fit(X_train, y_train)

# Validation: Predict on well 1
y_pred1 = model.predict(X_val1)
print("R2 Log 1: {}".format(round(model.score(X_val1, y_val1),4)))
rmse = np.sqrt(mean_squared_error(y_val1, y_pred1))
print("RMSE Log 1: {}".format(round(rmse,4)))

# Validation: Predict on well 3
y_pred3 = model.predict(X_val3)
print("R2 Log 3: {}".format(round(model.score(X_val3, y_val3),4)))
rmse = np.sqrt(mean_squared_error(y_val3, y_pred3))
print("RMSE Log 3: {}".format(round(rmse,4)))

# Validation: Predict on well 4
y_pred4 = model.predict(X_val4)
print("R2 Log 4: {}".format(round(model.score(X_val4, y_val4),4)))
rmse = np.sqrt(mean_squared_error(y_val4, y_pred4))
print("RMSE Log 4: {}".format(round(rmse,4)))

R2 Log 1: 0.9526

RMSE Log 1: 0.2338

R2 Log 3: 0.9428

RMSE Log 3: 0.2211

R2 Log 4: 0.8958

RMSE Log 4: 0.2459

This R2 is relatively good with relatively little effort!

Inverse Transformation of Prediction

# Make the transformer fit to the target
y = train[target_name].values
scaler.fit(y.reshape(-1,1))

# Inverse transform  y_val, y_pred
y_val1, y_pred1 = scaler.inverse_transform(y_val1.reshape(-1,1)), scaler.inverse_transform(y_pred1.reshape(-1,1))
y_val3, y_pred3 = scaler.inverse_transform(y_val3.reshape(-1,1)), scaler.inverse_transform(y_pred3.reshape(-1,1))
y_val4, y_pred4 = scaler.inverse_transform(y_val4.reshape(-1,1)), scaler.inverse_transform(y_pred4.reshape(-1,1))

Plot a comparison between train and prediction of the DT feature.

x = [y_val1, y_pred1, y_val3, y_pred3, y_val4, y_pred4]
y = [log1['DEPTH'], log1['DEPTH'], log3['DEPTH'], log3['DEPTH'], log4['DEPTH'], log4['DEPTH']]

color = ['red', 'blue', 'red', 'blue', 'red', 'blue']
title = ['DT Log1 1', 'Pred. DT Log 1', 'DT Log 3', 'Pred DT Log 3',
         'DT Log 4', 'Pred DT Log 4']

fig, ax = plt.subplots(nrows=1, ncols=6, figsize=(15,10))

for i in range(len(x)):
  ax[i].plot(x[i], y[i], color=color[i])
  ax[i].set_xlim(50, 150)
  ax[i].set_ylim(np.max(y[i]), np.min(y[i]))
  ax[i].set_title(title[i])

plt.tight_layout()

plt.show()


DT train vs prediction

Hyperparameter Tuning

This example below is GridSearchCV hyperparameter tuning on Scikit-Learn’s GradientBoostingRegressor, resulting in 31 models playing through all variations.

Different ways of searching hyperparameters are available with automated approaches of narrowing it down in a smarter way.

from sklearn.model_selection import train_test_split

# Define the X and y from the SVM normalized dataset
X = train_svm[feature_names].values
y = train_svm[target_name].values

# Train and test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

The grid search will churn for several minutes and the best scoring parameter combination saved printed.

from sklearn.model_selection import GridSearchCV

model = GradientBoostingRegressor()

# Hyperparameter ranges

max_features = ['auto', 'sqrt']
min_samples_leaf = [1, 4]
min_samples_split = [2, 10]
max_depth = [10, 100]
n_estimators = [100, 1000]

param_grid = {'max_features': max_features,
              'min_samples_leaf': min_samples_leaf,
              'min_samples_split': min_samples_split,
              'max_depth': max_depth,
              'n_estimators': n_estimators}

# Train with grid
model_random = GridSearchCV(model, param_grid, cv=3)
model_random.fit(X_train, y_train)

# Print best model
model_random.best_params_

{‘max_depth’: 100,

‘max_features’: ‘sqrt’,

‘min_samples_leaf’: 4,

‘min_samples_split’: 10,

‘n_estimators’: 1000}

Predict Test Wells

Define the Test Data

# Define the test data 
names_test = ['15_9-F-11B', '15_9-F-1C']

X_test = []
y_test = []
depths = []

for i in range(len(names_test)):
  # split the df with respect to its name
  test = pred.loc[pred['WELL'] == names_test[i]]

  # Drop well name column 
  test = test.drop(['WELL'], axis=1)

  # Define X_test (feature)
  X_test_ = test[feature_names].values

  # Define depth
  depth_ = test['DEPTH'].values
  
  X_test.append(X_test_)
  depths.append(depth_)

# For each well 2 and 5
X_test2, X_test5 = X_test
depth2, depth5 = depths

Transform Test - Predict - Inverse Transform

# Transform X_test of log 2 and 5
X_test2 = scaler.fit_transform(X_test2)
X_test5 = scaler.fit_transform(X_test5)

# Predict to log 2 and 5 with tuned model
y_pred2 = model_random.predict(X_test2)
y_pred5 = model_random.predict(X_test5)

y = train[target_name].values
scaler.fit(y.reshape(-1,1))

# Inverse transform y_pred
y_pred2 = scaler.inverse_transform(y_pred2.reshape(-1,1))
y_pred5 = scaler.inverse_transform(y_pred5.reshape(-1,1))

Plot the Predictions

plt.figure(figsize=(5,12))

plt.subplot(1,2,1)
plt.plot(y_pred2, depth2, color='green')
plt.ylim(max(depth2), min(depth2))
plt.title('Pred DT Log 2: 15_9-F-11B', size=12)

plt.subplot(1,2,2)
plt.plot(y_pred5, depth5, color='green')
plt.ylim(max(depth5), min(depth5))
plt.title('Pred DT Log 5: 15_9-F-1C', size=12)

plt.tight_layout()
plt.show()


DT test predictions

def makeplotpred(df,suptitle_str="pass a suptitle"):
  # Column selection from df
  col_names = ['NPHI', 'RHOB', 'GR', 'RT', 'PEF', 'CALI', 'DT']
  # Plotting titles
  title = ['NPHI', 'RHOB', 'GR', 'RT', 'PEF', 'CALI', 'Predicted DT']
  # plotting colors
  colors = ['purple', 'purple', 'purple', 'purple', 'purple', 'purple', 'green']

  # Create the subplots; ncols equals the number of logs
  fig, ax = plt.subplots(nrows=1, ncols=len(col_names), figsize=(15,10))
  fig.suptitle(suptitle_str, size=20, y=1.05)

  # Looping each log to display in the subplots
  for i in range(len(col_names)):
    if i == 3:
      # for resistivity, semilog plot
      ax[i].semilogx(df[col_names[i]], df['DEPTH'], color=colors[i])
    else:
      # for non-resistivity, normal plot
      ax[i].plot(df[col_names[i]], df['DEPTH'], color=colors[i])
  
    ax[i].set_ylim(max(df['DEPTH']), min(df['DEPTH']))
    ax[i].set_title(title[i], pad=15)
    ax[i].grid(True)

  ax[2].set_xlim(0, 200)
  plt.tight_layout(1)
  plt.show()
makeplotpred(log2,"Log 2: 15_9-F-11B")


DT test predictions

makeplotpred(log5,"Log 5: 15_9-F-1C")


DT test predictions

ADV SQL QRIES

ADV SQL QRIES

Remind me, how did that expression go?


Share:        
  1. CASE
    1. CASE and aggregate function (SUM)
    2. Percentages with AVG and CASE
  2. TRY queries and CATCH errors
    1. Nested TRY/CATCH
    2. CATCH Error Details
    3. Custom Errors, RAISERROR
  3. Subqueries
    1. Subqueries in WHERE clause
    2. Subqueries in FROM
    3. Subqueries in SELECT
    4. Correlated Subqueries
    5. Nested Subqueries in FROM
  4. Common Table Expressions (CTE)
  5. Window Functions - They all use OVER()
    1. RANK()
    2. DENSE_RANK()
    3. Giving rows a number with ROW_NUMBER()
    4. Framing for sliding windows using RANGE BETWEEN…AND or ROWS BETWEEN…AND
    5. Fetching relative rows: LAG() and LEAD()
    6. Fetching absolute rows: FIRST_VALUE() and LAST_VALUE()
    7. PARTITION BY subdivides processing ranges in a column
    8. Paging with NTILE()- subdivide a table into equal-length pages
  6. Pivoting with the CROSSTAB() Extension
  7. Group total rows with GROUP BY ROLLUP() and CUBE()
  8. Fill Null values with COALESCE()
  9. String Manipulation
    1. Compress a Column into a single String with STRING_AGG()
    2. Concatenating Multiple Columns with || and CONCAT()
    3. Changing Case with UPPER(), LOWER(), INITCAP()
    4. Replace Characters with REPLACE()
    5. Trim Characters with TRIM(), LTRIM(), RTRIM()
    6. Pad Characters with PAD(), LPAD(), RPAD()
    7. Reverse char order with REVERSE()
    8. Character Count with CHAR_LENGTH() or LENGTH()
    9. Find Char Position with STRPOS() or POSITION()
    10. Clip Strings from the Ends with LEFT(), RIGHT(), and SUBSTRING()
  10. Data Types
    1. Casting Data Types
    2. Creating Data Types (PostgreSQL)
    3. Getting Information about Data Types (PostgreSQL)
  11. User-Defined Functions (PostgreSQL)
  12. Text Search and NLP (PostgreSQL)
    1. Full-Text Search (PostgreSQL)
    2. Compare Two Strings (PostgreSQL)
  13. Drop Duplicate Rows (based on Specific Columns)
  14. Time Manipulation
    1. Intervals and Time Differences
    2. Extracting from Timestamps
    3. Truncating Timestamps
  15. Transactions (MS SQL SERVER)
    1. With TRY/CATCH
    2. With XACT_ABORT
  16. Special Case: Accessing ARRAYS
  17. SQL query in Python Pandas
    1. ETL process
  18. Table Management
    1. Create Table
    2. Insert Values into Columns
    3. System Database Information Gathering
  19. Extension Management (PostgreSQL)
    1. Show Extensions
    2. Install/Import Extensions

The following post consists of notes I had on my drive and I had been returning to over and over to remind me of some concepts and SQL syntax.

Of course, the original tables and outputs are not available here, but the point is providing a lookup for structures and building blocks.

CASE

The CASE statement can create a new column based on multiple conditions in a specified column.

The structure always follows CASE WHEN THEN ELSE END AS. The ELSE and AS are optional

SELECT  
	m.date,
	t.team_long_name AS opponent,
	CASE WHEN m.home_goal < m.away_goal THEN 'Barcelona win!'
		 WHEN m.home_goal > m.away_goal THEN 'Barcelona loss :(' 
		 ELSE 'Tie' END AS outcome
FROM matches_spain AS m
LEFT JOIN teams_spain AS t 
ON m.hometeam_id = t.team_api_id
WHERE m.awayteam_id = 8634;

CASE and aggregate function (SUM)

The following query creates multiple new columns with a summation aggregate function.

SELECT 
	c.name AS country,
	SUM(CASE WHEN m.season = '2012/2013' AND m.home_goal > m.away_goal 
			 THEN 1 ELSE 0 END) AS matches_2012_2013,
				
	SUM(CASE WHEN m.season = '2013/2014' AND m.home_goal > m.away_goal 
			 THEN 1 ELSE 0 END) AS matches_2013_2014,
				
	SUM(CASE WHEN m.season = '2014/2015' AND m.home_goal > m.away_goal
			 THEN 1 ELSE 0 END) AS matches_2014_2015
FROM country AS c
LEFT JOIN match AS m
ON c.id = m.country_id
GROUP BY country;

Percentages with AVG and CASE

For each country calculate the percentage of ties (home=away goals_). Use the AVG and CASE (for 1,0 assignment) functions.

SELECT 
	c.name AS country,
	ROUND(AVG(CASE WHEN m.season='2013/2014' AND m.home_goal = m.away_goal THEN 1
					WHEN m.season='2013/2014' AND m.home_goal != m.away_goal THEN 0
					END),2) AS pct_ties_2013_2014,
				   
	ROUND(AVG(CASE WHEN m.season='2014/2015' AND m.home_goal = m.away_goal THEN 1
					WHEN m.season='2014/2015' AND m.home_goal != m.away_goal THEN 0
					END),2) AS pct_ties_2014_2015
FROM country AS c
LEFT JOIN matches AS m
ON c.id = m.country_id
GROUP BY country;

TRY queries and CATCH errors

Similar as in most languages, SQL can make use of Try/Catch (sometimes called exception/error blocks).

BEGIN TRY
	ALTER TABLE products
		ADD CONSTRAINT CHK_Stock CHECK (stock >= 0);
END TRY

BEGIN CATCH
	SELECT 'There was an error adding the constraing!';
END CATCH

Nested TRY/CATCH

It is also possible to have a nested TRY block inside the CATCH block, which lets you attempt a second query if the first failed.

Use example: the nested (inner) TRY block could be used to write an error message into a table that the outer TRY failed.

BEGIN TRY
	INSERT INTO bikes (make, model, year, size)
		VALUES ('Nukeproof', 'Grizzlybear', '2021', 'S');
END TRY

BEGIN CATCH
	SELECT 'An error occurred inserting the bike!';
    BEGIN TRY
    	INSERT INTO errors 
        	VALUES ('Error inserting a bike');
        SELECT 'Error inserted correctly!';
	END TRY
	
    BEGIN CATCH
    	SELECT 'An error occurred inserting the error!';
    END CATCH 
END CATCH

CATCH Error Details

The following query writes a message to an ‘error’ table if the outer Try block fails. If that fails, too, then the error details of the inner try are queried.

BEGIN TRY 	
	INSERT INTO bikes (make, model, year, size)
		VALUES ('Nukeproof', 'Grizzlybear', '2021', 'S');  
END TRY

BEGIN CATCH
	SELECT 'An error occurred inserting the product!';
    BEGIN TRY
    	INSERT INTO errors 
        	VALUES ('Error inserting a product');
    END TRY
	
    BEGIN CATCH
    	SELECT  ERROR_NUMBER() AS number,  
        	ERROR_SEVERITY() AS severity_level,  
        	ERROR_STATE() AS state,
        	ERROR_LINE() AS line,  
        	ERROR_MESSAGE() AS message; 
    END CATCH    
END CATCH

Custom Errors, RAISERROR

‘RAISERROR’ lets you define a custom error and parameters. Syntax:

RAISERROR ( { msg_str | msg_id | @local_variable_message },  
    severity, 
    state,
    [ argument [ ,...n ] ] )  
    [ WITH option [ ,...n ] ] 

@product_id = 56747

BEGIN TRY
IF NOT EXISTS (SELECT * FROM products WHERE product_id = @product_id)
	RAISERROR('No product with id %d.', 11, 1, @product_id);
ELSE 
	SELECT * FROM products WHERE product_id = @product_id;
END TRY

BEGIN CATCH
    SELECT  ERROR_NUMBER() AS number,  
        ERROR_SEVERITY() AS severity_level,  
        ERROR_STATE() AS state,
        ERROR_LINE() AS line,  
        ERROR_MESSAGE() AS message; 
END CATCH

Subqueries

While regular subqueries can be calculated on their own, correlated subqueries cannot need results from outer queries. Queries with multiple subqueries will get complicated quickly.

Subqueries in WHERE clause

SELECT
	team_long_name,
	team_short_name
FROM team
WHERE team_api_id IN
	(SELECT hometeam_id 
	 FROM match
	 WHERE home_goal >= 8);

Subqueries in FROM

SELECT
	country,
	date,
	home_goal,
	away_goal
FROM 
	(SELECT c.name AS country, 
			m.date, 
			m.home_goal, 
			m.away_goal,
			(m.home_goal + m.away_goal) AS total_goals
	FROM match AS m
	LEFT JOIN country AS c
	ON m.country_id = c.id) AS subq
WHERE total_goals >= 10;

Subqueries in SELECT

Subqueries in SELECT statements generate a single value!

SELECT 
	l.name AS league,
   	ROUND(AVG(m.home_goal + m.away_goal), 2) AS avg_goals
	(SELECT ROUND(AVG(home_goal + away_goal), 2) 
     	FROM match
     	WHERE season = '2013/2014') AS overall_avg
FROM league AS l
LEFT JOIN match AS m
ON l.country_id = m.country_id
WHERE season = '2013/2014'
GROUP BY league;

Correlated Subqueries

Correlated subqueries reference one or more columns in the main query. They dend on information in the main query to run, and thus, cannot be executed on their own. Correlated subqueries are evaluated in SQL once per row of data retrieved - a process that takes a lot more computing power and time than a simple subquery.

The following WHERE clause filters for entries with total goals higher than three times the average.

SELECT 
	main.country_id,
	date,
	main.home_goal, 
	main.away_goal
FROM match AS main
WHERE 
	(home_goal + away_goal) > 
		(SELECT AVG((sub.home_goal + sub.away_goal) * 3)
		 FROM match AS sub
		 WHERE main.country_id = sub.country_id);

Nested Subqueries in FROM

Now it’s getting complicated. Best design such queries from inner to outer.

SELECT
	c.name AS country,
	AVG(outer_s.matches) AS avg_seasonal_high_scores
FROM country AS c
LEFT JOIN (
		SELECT country_id, season,
			COUNT(id) AS matches
		FROM (
			SELECT country_id, season, id
			FROM match
			WHERE home_goal >= 5 OR away_goal >= 5) AS inner_s
		GROUP BY country_id, season) AS outer_s
ON c.id = outer_s.country_id
GROUP BY country;

Common Table Expressions (CTE)

CTEs are declared and executed ahead of the main query and help organizing queries with complicated subqueries.

They always precede the main query and go WITH cte AS (SELECT…..)

WITH home AS (
	SELECT m.id,
		m.date,
		t.team_long_name AS hometeam,
		m.home_goal
	FROM match AS m
	LEFT JOIN team AS t 
	ON m.hometeam_id = t.team_api_id),

away AS (
	SELECT m.id,
		m.date, 
		t.team_long_name AS awayteam,
		m.away_goal
  	FROM match AS m
  	LEFT JOIN team AS t 
  	ON m.awayteam_id = t.team_api_id)

SELECT 
	home.date,
	home.hometeam,
	away.awayteam,
	home.home_goal,
	away.away_goal
FROM home
INNER JOIN away
ON home.id = away.id;

Window Functions - They all use OVER()

The OVER() clause allows you to pass an aggregate function down a data set, similar to subqueries in SELECT, but run faster.

The empty OVER() generates column with an aggregate of the complete column and the same value in each row.

SELECT 
	m.id, 
	c.name AS country, 
   	m.season,
	m.home_goal,
	m.away_goal,
	AVG(m.home_goal + m.away_goal) OVER() AS overall_avg
FROM match AS m
LEFT JOIN country AS c ON m.country_id = c.id;

The following qerry with the OVER(ORDER BY...) calculates the running total/sum up to the current row.

SELECT
  athlete,
  medals,
  SUM(medals) OVER(ORDER BY athlete ASC) AS sum_medals
FROM athlete_medals
ORDER BY athlete ASC;

RANK()

Calculate a ranking and also order the query by rank.

Note: Assigns the same number to rows with identical values and skips over the next numbers. E.g.: 1,2,3,3,5,6,6,8…

The following query calculates the average goals per game and groups by league. The averages are ranked by league.

SELECT 
	l.name AS league,
	AVG(m.home_goal + m.away_goal) AS avg_goals,
	RANK() OVER(ORDER BY AVG(m.home_goal + m.away_goal) DESC) AS league_rank
FROM league AS l
LEFT JOIN match AS m 
ON l.id = m.country_id
WHERE m.season = '2011/2012'
GROUP BY l.name
ORDER BY league_rank;
SELECT
  athlete,
  medals,
  RANK() OVER (ORDER BY medals DESC) AS Rank_N
FROM Athlete_Medals
ORDER BY Medals DESC;

DENSE_RANK()

Note: Assigns same number to rows with identical values but does NOT skip over the next numbers. E.g.: 1,2,3,3,4,5,5,6…

SELECT
  country,
  athlete,
  DENSE_RANK() OVER (PARTITION BY country ORDER BY medals DESC) AS rank_n
FROM athlete_medals
ORDER BY country ASC, rank_n ASC;

Giving rows a number with ROW_NUMBER()

Generate a column that counts the current ROW_NUMBER().

SELECT
  *,
  ROW_NUMBER() OVER() AS Row_N
FROM Summer_Medals
ORDER BY Row_N AS

The ORDER BY inside the OVER() clause takes precedence over outer ORDER BY. The following queries only calls distinct years and does not requrie a subquery.

SELECT
  Year,
  ROW_NUMBER() OVER(ORDER BY Year ASC) AS Row_N
FROM (
  SELECT DISTINCT Year
  FROM Summer_Medals
) AS Yea

Framing for sliding windows using RANGE BETWEEN…AND or ROWS BETWEEN…AND

Keywords for sliding windows are UNBOUNDED, n PRECEDING, n FOLLOWING, and CURRENT ROW

The window is supposed to run sum and averages between the current row and the top of the table or bottom of the table. They both arrive at the same cumulative sum, but the way to get there is different.

Difference RANGE vs. ROW: RANGE() treats duplicates in OVER(ORDER BY...) as a single entity, which is a special case. ROWS behaves normally.

MedalsRows Running TotalRange Running Total
101010
5060110
50110110
60170230
60230230
70300300

The following query is calculating the Running Total (RT) and Running Aaverage (). The window uses ascending date ordering (smallest date number at top) -> the window goes from top (oldest date) of table to current row.

SELECT 
	date,
	home_goal,
	away_goal,
	SUM(home_goal) OVER(ORDER BY date 
		ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_total,
	AVG(home_goal) OVER(ORDER BY date 
		ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_avg
FROM match
WHERE 
	hometeam_id = 9908 
	AND season = '2011/2012';

The following query is calculating the RT and RA in descending date ordering (largest date number at top) -> the window goes from current row to bottom (oldest date).

SELECT 
	date,
	home_goal,
	away_goal,
	SUM(home_goal) OVER(ORDER BY date DESC
		ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS running_total,
	AVG(home_goal) OVER(ORDER BY date DESC
		ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS running_avg
FROM match
WHERE 
	awayteam_id = 9908 
    AND season = '2011/2012';

Moving average (MA). Calculate a column with the 3-year (2 preceding + current) average.

SELECT
	year,
	medals,
	AVG(medals) OVER(ORDER BY year ASC
		ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS medals_ma
FROM russian_medals
ORDER BY year ASC;

Fetching relative rows: LAG() and LEAD()

The LAG(column, n) function fetches a single value from BEFORE (above) the current row (given vertical lag distance) for creating a new column.

The following query determins (1) for each year’s Olympics which the gold medal country is and (2) which country is the reigning/last champion country.

SELECT
	Year, Champion,
	LAG(Champion) OVER(ORDER BY year ASC) AS Last_Champion
FROM weightlifting_gold
ORDER BY year ASC;

The LEAD(column, n) function fetches a single value from AFTER (below) the current row (given vertical lag distance) for creating a new column.

SELECT
	year,
	athlete,
	LEAD(athlete, 3) OVER (ORDER BY year ASC) AS future_champion
FROM discus_medalists
ORDER BY year ASC;

Fetching absolute rows: FIRST_VALUE() and LAST_VALUE()

These functions do not depend on the current row’s position. FIRST_VALUE(column) and LAST_VALUE(column)

The following query fetches the gold-winner athlete and a second column with the first athlete. The order is alphabetically.

SELECT
	athlete,
	FIRST_VALUE(athlete) OVER(ORDER BY athlete ASC) AS First_Athlete
FROM all_male_medalists;

Fetching the last value is more complicated, since it requires the definition of the RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING.

Otherwise the last value of the window is the current row, and that’s now what we mean.

SELECT
	year,
	city,
	LAST_VALUE(city) OVER(ORDER BY year ASC
		RANGE BETWEEN
		UNBOUNDED PRECEDING AND
		UNBOUNDED FOLLOWING
		) AS last_city
FROM hosts
ORDER BY Year ASC;

PARTITION BY subdivides processing ranges in a column

Window functions can be applied to separate columns into processing sections. This way, it can be avoided that functions leak over from one section into another, e.g. when using ROW_NUMBER or LAG() functions.

The following query partitions the table by gender and avoids the LAG() function determining gold winners of ‘male’ spilling into ‘females’. In that case, a null value will be displayed.

SELECT
	gender,
	year,
	country AS champion,
	LAG(country,1) OVER(PARTITION BY gender ORDER BY year ASC) AS last_champion
FROM athletics_gold
ORDER BY gender ASC, year ASC;

Partition by multiple columns (gender and event).

SELECT
	gender,
	year,
	event,
	country AS champion,
	LAG(country,1) OVER (PARTITION BY gender, event ORDER BY year ASC) AS last_champion
FROM athletics_gold
ORDER BY event ASC, gender ASC, year ASC;

Paging with NTILE()- subdivide a table into equal-length pages

NTILE(num_pages) OVER(ORDER BY column) AS new_column divides the table into num_pages depending on how the values in the given column. The new column contains the number of which page they were assigned. These pages are practically quantiles on which statistical analysis can be performed.

The following query subdivides the ordered medals into three pages.

SELECT
	athlete,
	medals,
	NTILE(3) OVER(ORDER BY medals DESC) AS third
FROM athlete_medals
ORDER BY medals DESC, athlete ASC;

Perform aggregate function analysis on the pages/quantiles.

The following query gets the average medals earned in each page.

WITH thirds AS (
	SELECT
		athlete,
		medals,
		NTILE(3) OVER(ORDER BY medals DESC) AS third
	FROM athlete_medals)
  
SELECT
	third,
	AVG(medals) AS avg_medals
FROM thirds
GROUP BY third
ORDER BY third ASC;

Pivoting with the CROSSTAB() Extension

Pivoting with CROSSTAB requires the tablefunc extension that needs to be created befor qerrying.

A regular query gets entered into the SELECT * FROM CROSSTAB($$....) AS ct (column COLTYPE,...) function and the new pivoted column names defined.

CREATE EXTENSION IF NOT EXISTS tablefunc;

SELECT * FROM CROSSTAB($$
	SELECT
		gender,
		year,
		country
	FROM summer_medals
	WHERE
		year IN (2008, 2012)
		AND medal = 'gold'
		AND event = 'pole vault'
	ORDER BY gender ASC, year ASC;
$$) AS ct (gender VARCHAR,
           "2008" VARCHAR,
           "2012" VARCHAR)
ORDER BY gender ASC;

Original:

genderyearcountry
Men2008AUS
Men2012FRA
Women2008RUS
Women2012USA

Pivot by year:

gender20082012
MenAUSFRA
WomenRUSUSA

Group total rows with GROUP BY ROLLUP() and CUBE()

ROLLUP() vs CUBE()

Use ROLLUP() for hierachical data such as data parts (year, quarter) and don’t wall all possible group-level aggregations. Use CUBE() for all possible group-level aggregations

Null entries in the rows signifies the total rows.

The following query counts medals by country and gender. At the gender group level there is a total row (male+female) using the ROLLUP(gender) clause.

SELECT
	country,
	gender,
	COUNT(*) AS gold_awards
FROM summer_medals
WHERE
	year = 2004
	AND medal = 'Gold'
	AND country IN ('DEN', 'NOR', 'SWE')
GROUP BY country, ROLLUP(gender)
ORDER BY country ASC, gender ASC;

Group totals by all possible groups with CUBE(). Note: all columns have to go into the CUBE() or it will fail.

SELECT
	gender,
	medal,
	COUNT(*) AS awards
FROM Summer_Medals
WHERE
	year = 2012
	AND country = 'RUS'
GROUP BY CUBE(gender, medal)
ORDER BY gender ASC, medal ASC;

Fill Null values with COALESCE()

Useful for any function that produces nulls, such as ROLLUP(), CUBE(), pivoting, first/last row of LAG(), LEAD().

Use in SELECT section of query: COALESCE(column, 'null filler string') AS column.

SELECT
	COALESCE(country, 'all countries') AS country,
	COALESCE(gender, 'all genders') AS gender,
	COUNT(*) AS awards
FROM summer_medals
WHERE
	year = 2004
	AND medal = 'gold'
	AND country IN ('DEN', 'NOR', 'SWE')
GROUP BY ROLLUP(country, cender)
ORDER BY country ASC, gender ASC;

String Manipulation

Compress a Column into a single String with STRING_AGG()

Use in SELECT statemtt like as such: STRING_AGG(column, 'separator')

SELECT
	STRING_AGG(country, ', ')
FROM country_ranks
WHERE rank_col <= 3;

Input Table:

CountryRank
USA1
RUS2
AUS3

Output string: USA, RUS, AUS

Concatenating Multiple Columns with || and CONCAT()

Merge strings with ||

SELECT 
	first_name || ' ' || last_name AS full_name
FROM
	film;

CHARLOTTE HUNTER

Another example:

SELECT
	customerID || ': ' || first_name || ' ' || last_name AS full_name;

1: CHARLOTTE HUNTER

Merge strings with posgreSQL’s CONCAT()

SELECT
	CONCAT(first_name,' ', last_name) AS full_name
FROM
	film;

1: CHARLOTTE HUNTER

Changing Case with UPPER(), LOWER(), INITCAP()

SELECT
	UPPER(email)
FROM customer;

[email protected]

SELECT
	LOWER(email)
FROM customer;

[email protected]

SELECT
	INITCAP(title)
FROM film;

Police Academy

Replace Characters with REPLACE()

SELECT
	REPLACE(email, 'hedgeAThog.com', '[email protected]') AS email
FROM customer;

Trim Characters with TRIM(), LTRIM(), RTRIM()

TRIM([leading | trailing | both][characters] from string)

If just the column string is passed to TRIM(), then all space chars is be removed by default.

If no trim char string passed, is entered, then space char is the default.

SELECT
	TRIM(),
	LTRIM('    [email protected]    '),
	RTRIM('    [email protected]    ')

[email protected] ‘, ‘ [email protected]

Pad Characters with PAD(), LPAD(), RPAD()

Syntax: LPAD(input_string, length_after_pad, lpad_string) Note that if the input string is shorter than the padded length, then no padding string will be inserted! If no padding string passed, is entered, then space char is the default.

SELECT
	LPAD('[email protected]', 20, '#'),
	RPAD('[email protected]', 17, '$')

’#######[email protected]’, ‘[email protected]\(\)$’

Reverse char order with REVERSE()

SELECT
	email,
	REVERSE(email)
FROM customer;

[email protected], moc.goh@egdeh

Character Count with CHAR_LENGTH() or LENGTH()

SELECT
	title,
	CHAR_LENGTH(title)
	LENGTH(title)
FROM film;

Police Academy, 14, 14

Find Char Position with STRPOS() or POSITION()

Remember, that SQL indexing starts at 1, not 0!

SELECT
	email,
	STRPOS(email, '@'),
	POSITION('@' IN email)
FROM customer;

[email protected], 6, 6

Clip Strings from the Ends with LEFT(), RIGHT(), and SUBSTRING()

LEFT(), RIGHT(): Specify how many characters to be clipped starting from the ends.

SELECT
	RIGHT(column_name, 50),
	LEFT(column_name, 10)
FROM customer;

SUBSTRING()/SUBSTR(): Specify the start index and length.

SELECT
	SUBSTRING(column_name, 10, 50),
	SUBSTR(colum_name, 10, 50)
FROM customer;

SUBSTRING(): Use functions to determine start/end index with FROM…FOR expression.

SELECT
	SUBSTRING(email FROM 0 FOR POSITION('@' IN email)), -- name part of address up to @
	SUBSTRING(email FROM POSITION('@' IN email)+1 FOR CHAR_LENGTH(email)) -- provider part of address starting after @
FROM customer;

hedge, hog.com

Clip a column to 50 characters but do not clip a word in half.

SELECT
	LEFT(description, 50 - POSITION(' ' IN REVERSE(LEFT(description, 50)) -- Subtract the position of the first whitespace character
FROM film;

Data Types

Casting Data Types

PostgreSQL native casting with two colons column :: data-type

SELECT
	NOW():: timestamp;

Standard SQL casting with CAST(column AS data-type) function.

SELECT
	CAST(NOW() AS timestamp);

Creating Data Types (PostgreSQL)

CREATE
	TYPE dayofweek as ENUM(
		'Monday',
		'Tuesday',
		'Wednesday',
		'Thursday',
		'Friday',
		'Saturday',
		'Sunday'
		);

Getting Information about Data Types (PostgreSQL)

SELECT
	typname,
	typcategory
FROM pg_type
WHERE typname = 'dayofweek';

User-Defined Functions (PostgreSQL)

A User-Defined Function is the equivalent of a Stored Proceedure.

A function can bundle several SQL queries together into a package using CREATE.

CREATE FUNCTION squared (i integer) RETURNS integer AS $$
	BEGIN
		RETURN i*i;
	END;
$$ language plpgsql;	--define language

SELECT
	squared(10);

100

Text Search and NLP (PostgreSQL)

Full-Text Search (PostgreSQL)

Postgre has extensions of the normal WHERE column LIKE '%mystring%' with wildcards (%,_) etc. capabilities.

Additional features are stemming, fuzzy matching (spelling mistakes), and ranking.

SELECT
	title,
	description
FROM film
WHERE to_tsvector(title) @@ to_tsquery('elf');

Compare Two Strings (PostgreSQL)

Calculate the Levenshtein Distance (number of edits required between two words).

SELECT
	levenshtein('hugo', 'hagolm');

3

Calculate Similarity (0-1)

	similarity('hugo', 'hagolm');

Drop Duplicate Rows (based on Specific Columns)

Drop duplicate rows based on specific columns, not the entire rows is duplicated.

Use a Common Table Expression (CTE) with a window function ‘PARTITION BY’ that creates a row number per window.

Any duplicates will have a row number >2. Then, evaluate the CTE and only keep row numbers = 1.

WITH cte AS (
    SELECT *, 
        ROW_NUMBER() OVER (
            PARTITION BY 
                col_1, 
                col_2, 
                col_3
			ORDER BY 
                col_1, 
                col_2, 
                col_3
        ) row_num
    FROM orders_table
)
SELECT * 
FROM cte
WHERE row_num = 1;
idsalesworker_firstnameworker_lastnameregistration_daterow_num
00000011913JohnnySmith2014-01-311
00000012714BrianMerkel2014-02-281
00000012029KarinHarris2014-01-311
00000012816CarolSchulze2014-02-281

Time Manipulation

  • NOW() and CURRENT_TIMESTAMP() fetch time (millisecond precision) with timezone! However, CURRENT_TIMESTAMP(2) can be passed an argument for roundeding to digits. If the timezone is not required, then these two functions can be cast to timestamp data-type.
  • CURRENT_DATE only fetches the date portion of the CURRENT_TIMESTAMP()
  • CURRENT_TIME() only fetches the time portion (with timezone) of the CURRENT_TIMESTAMP(). Can be passed rounding parameter.

Intervals and Time Differences

This query selects a timestamp style column and performs arithmetic on it with the INTERVAL function. The INTERVAL add 3 days to the rental date so we can compare if the return date was in time.

SELECT
	rental_date,
	return_date,
	rental_date + INTERVAL '+ `+3 days' AS expected_return_date
FROM rental;

Convert integers to time intervals.

SELECT
	title,
	INTERVAL '1' day * f.rental_duration AS rental_duration,
	return_date - rental_date AS days_rented
FROM film
WHERE return_date IS NOT NULL

The AGE() function calculates an INTERVAL between a given timestamp/date and NOW if ONE parameter given. If TWO parameters are given, then the interval between those.

SELECT
	AGE(rental_date)
FROM rental;

Extracting from Timestamps

Two useful functions are EXTRACT(field FROM source) and DATE_PART('field',source), which produce interchangeable results.

Timestamp parts/fields to extract:

Field (long)Field (short)Description
yearyYear
quarter Quarter
month Month
daydDay of Month
 dowDay of Week
weekwWeek
hourhhour
minutemMinute
secondsSecond
millisecondmsMillisecond

Examples for the quarter of the year in a timestamp:

SELECT
	EXTRACT(quarter FROM TIMESTAMP '2009-02-12 05:18:00') AS quarter;

1

SELECT
	DATE_PART('quarter', TIMESTAMP '2009-02-12 05:18:00') AS quarter;

1

SELECT 
	EXTRACT(dow FROM rental_date) AS dayofweek, 
	COUNT(*) as rentals 
FROM rental 
GROUP BY 1; -- column called by positional index instead of names

The following query extracts quarters and years from a timestamp column and sums the payment amount by by quarter and years.

SELECT
	EXTRACT(quarter FROM payment_date) AS quarter,
	EXTRACT(year FROM payment_date) AS year,
	SUM(amount) AS total_payments
FROM payment
GROUP BY 1, 2; 

The following query extracts the rental duration (rental_days) in a time interval of 90 days starting from 2005-05-01.

SELECT 
	EXTRACT(dow FROM r.rental_date) AS dayofweek,
	AGE(return_date, rental_date) AS rental_days
FROM rental AS r 
WHERE rental_date
	BETWEEN CAST('2005-05-01' AS TIMESTAMP)
	AND CAST('2005-05-01' AS TIMESTAMP) + INTERVAL '90 day';

Truncating Timestamps

The DATE_TRUNC() function will truncate a timmestamp or interval data types from fine detail (e.g., minutes) to low detail (e.g., year). This means the timestamp’s length remains the same but set to the lowest logical value (01 for days/months, 00 for hours/minutes/seconds).

SELECT
	DATE_TRUNC('year', TIMESTAMP '2009-02-12 05:18:00';

2009-01-01 00:00:00

SELECT
	DATE_TRUNC('month', TIMESTAMP '2009-02-12 05:18:00';

2009-05-01 00:00:00

The following query truncates to a day and counts the number of entries (rentals) by day.

SELECT 
	DATE_TRUNC('day', rental_date) AS rental_day,
	COUNT(*) AS rentals 
FROM rental
GROUP BY 1;

Transactions (MS SQL SERVER)

Transactions move around values between owners. For example money between bank accounts. This means a transaction contains a subtraction and addition that both have to succeed.

The concept of “Atomic transactions” rolls back a transaction if any of its parts raised an error. SQL has a specific ‘TRAN’ keyword that can be put in a ‘TRY’ block. If the ‘TRAN’ raised an error, the ‘CATCH’ block calls a ‘ROLLBACK TRAN’.

With TRY/CATCH

BEGIN TRY  
	BEGIN TRAN;
		UPDATE accounts SET current_balance = current_balance - 100 WHERE account_id = 4451;
		INSERT INTO transactions VALUES (4451, -100, GETDATE());
        
		UPDATE accounts SET current_balance = current_balance + 100 WHERE account_id = 6743;
		INSERT INTO transactions VALUES (4451, 100, GETDATE());
	COMMIT TRAN;
END TRY

BEGIN CATCH
	SELECT 'Rolling back the transaction';
	ROLLBACK TRAN;
END CATCH

With XACT_ABORT

With ‘SET XACT_ABORT ON;’ errors in a ‘TRAN’ block will always be rolled back. No ‘TRY’ block needed.

Note: ‘THROW’ error still works with XACT_ABORT.

SET XACT_ABORT ON;

BEGIN TRAN; 
	UPDATE accounts set current_balance = current_balance - current_balance * 0.01 / 100
		WHERE current_balance > 5000000;
	IF @@ROWCOUNT <= 10	
		THROW 55000, 'Not enough wealthy customers!', 1;
	ELSE		
		COMMIT TRAN;

Special Case: Accessing ARRAYS

It is not necessary to fetch the entire array of a column. We can select specific elements with square brackets. Indexing in PostgresSQL starts with 1, and not 0 as with many other languages.

Search for text in ARRAY in specific position.

SELECT
	email[1][1] AS type,			-- select from first address the first element (address type/description)
	email[1][2] AS address,			-- select from first address the second element (email address)
	test_scores[1]				    -- select only first element of the array
FROM funky_table
WHERE email[1][1] = "work";			-- only search in in array element 1 and sub-element 1

Search for text in any ARRAY position with ANY or @> operators. Results are equivalent.

SELECT
	email[2][1] AS type,			-- select from second address the first element (address type/description)
	email[2][2] AS address,			-- select from second address the second element (email address)
	test_scores[1]				    -- select only first element of the array
FROM funky_table
WHERE 'home' = ANY (email)			-- select where the column `email` contains an array element named 'home' in any array position)
SELECT 
  email[2][1] AS type, 
  email[2][2] AS address
  test_scores[1]
FROM funky_table
WHERE email @> ARRAY['home'];

SQL query in Python Pandas

The SQL query is passed as a single string. The tripple quotation mark wrapping (“””) is useful to have multi-row strings in Python. The table name goes in quotation marks and the SQL server connection has to be passed after the query.

Connection URI (Uniform Resource Identifier) for PostgreSQL: postgresql://[user[:password]@][host][:port][/database]

import sqlalchemy

connection_uri = "postgresql://repl:password@localhost:5432/pagila"
db_engine = sqlalchemy.create_engine(connection_uri)

import pandas as pd

df = pd.read_sql("""
SELECT
    first_name,
    last_name
FROM "Customer"
ORDER BY last_name, first_name
""", db_engine)

df.head(3)
 first_namelast_name
0ConnaghBailey
1BrookBloom
2AnnDalton

ETL process

  1. For the “Extraction” process from a OLAP db, the following function queries the entire SQL table using a function and loads it into a Pandas dataframe.
import sqlalchemy
import pandas as pd

# custom extraction function
def extract_table_to_pandas(tablename, db_engine):
    query = "SELECT * FROM {}".format(tablename)
    return pd.read_sql(query, db_engine)

connection_uri = "postgresql://repl:password@localhost:5432/pagila" 
db_engine = sqlalchemy.create_engine(connection_uri)

extract_table_to_pandas("film", db_engine)
  1. Do some column transformations in Spark e.g., convert float to string and split on pattern into multiple columns.

  2. Loading into analytical PostgreSQL database

connection_uri = "postgresql://repl:password@localhost:5432/dwh"
db_engine_dwh = sqlalchemy.create_engine(connection_uri)

film_pdf_joined.to_sql("film", db_engine_dwh, schema="store", if_exists="replace")

# check the table in the SQL db
pd.read_sql("SELECT film_id, recommended_film_ids FROM store.film", db_engine_dwh)

Table Management

Create Table

Creating a table schema works with [column name] [data-type]. ARRAYS can be created with square brackets. Multiple square brackets are nested arrays.

CREATE TABLE funky_table (
	column_1 text,
	column_2 integer,
	column_3 scores[],
	employee_id integer,
	email text[][]
	);

Insert Values into Columns

Inserting works with calling the column names if only specific columns are inserted into.

INERT INTO funnky_table
	(column_1, column_2) VALUES ('string for column_1', 12);

Inserting multiple rows into all columns at once can be done without column names, but the order has to match the columns. ARRAYS are inserted with curly braces {} and array elements separated by commas.

INSERT INTO funky_table
	VALUES ("Prickly Hedgehog",
		2,
		{99,83,18,811}, -- standard array with four elements
		2224,
		{\{"work", "[email protected]"},{"home", "[email protected]"}}  -- nested array (remove escape char \)
		);

System Database Information Gathering

Which tables are in the system database?

SELECT * 
FROM
	information_schema.tables
WHERE
	table_schema = 'public';

Which columns are in the ‘actor’ table and what are their details?

SELECT * 
FROM
	information_schema.columns
WHERE
	table_name = 'actor';

Column descriptions are detailed: table_catalog table_schema table_name column_name ordinal_position column_default is_nullable data_type character_maximum_length ….. and so on.

Extension Management (PostgreSQL)

Show Extensions

Available in the distribution ready to be installed.

SELECT
	name
FROM pg_available_extensions;

dblink, pg_stat_tatements, pg_trgm

Currently installed and ready for use.

SELECT
	extname
FROM pg_extension;

plpgsql

Install/Import Extensions

First install and then check if the installation succeeded.

CREATE EXTENSION IF NOT EXISTS fuzzystrmatch;
SELECT
	extname
FROM pg_extension;

The New Business of AI (and How It’s Different From Traditional Software) - a16z

The New Business of AI (and How It’s Different From Traditional Software) - a16z

Summary from reading the Andreessen Horowitz blog


Share:        
  1. Preface
  2. Introduction
  3. Software + Services = AI?
    1. Software Company
    2. Service Company
    3. AI Company
  4. Gross Margins, Part 1: Cloud costs can be massive for AI
  5. Gross Margins, Part 2: Many AI applications rely on “humans in the loop”
  6. Scaling AI systems can be rockier - AI lives in the long tail
  7. The how to defend AI businesses has to be figured out
  8. Building, scaling, and defending great AI companies – practical advice for founders
  9. Summary

The writeup below are some highlights I wrote for myself. If you find the writeup useful and want to know more, then you should absolutely read the article linked.

Preface

This is my personal summary from reading The New Business of AI (and How It’s Different From Traditional Software), which was published on the Andreessen Horowitz blog a16z.com, a Silicon Valley venture capital (VC) investor. This VC supported the likes of Skype and Facebook when they were still little known and dozens of others whos digital services we are using in our daily lives. The article is excellently written, and part of the AI in 2020 series. This series is a distillation of dozens of interviews the authors lead with ML tams and thus marks a great resource for machine learning professionals who want deep insights into new AI developments, be it startups or mature companies. And last but not least it helps evaluating the next job opportunities. Most of all, the series is intended by Andreessen Horowitz to support AI companies, a sector in which they are investing heavily.

Introduction

Investors are betting today that AI companies will make a similar return as software companies.

However, in many cases that AI companies have a different economic construction than software businesses:

  • Lower gross margins -> heavy cloud infrastructure + human support

  • Scaling problems due to edge cases.

  • Weaker defensive moats -> commoditization of AI models and challenges with data network effects.

Consistent pattern in the financial data of AI companies: gross margins 50-60% range, 60-80%+ for SaaS businesses. Early investors pushing AI companies for rapid growth over returns may obscure lower growth rates. AI is going to be a new type of business - after SaaS - and it is clearly different from Software. The differences have to be acknowledged and addressed.

Software + Services = AI?

Software Company

Software (including SaaS) can be produced once and sold many times. Benefits are: A) recurring revenue streams, B) high (60-80%+) gross margins, and C) super linear scaling with (rare) network effects, D) ownership of intellectual property (source code) builds defensive moats.

Service Company

Each new project requires dedicated headcount and can be sold exactly once: B) revenue is non-recurring, B) gross margins are lower (30-50%), C) and scaling is linear at best, D) Defensibility is challenging – often based on brand or incumbent account control – because any IP not owned by the customer is unlikely to have broad applicability.

AI Company

Appear to combine elements of both software and services. Most AI applications look + feel like normal software: interfacing with users, managing data, or integrating with other systems. However, at the center is a set of trained data models. Maintaining these models is more like a services business. That is: requiring significant, customer-specific work and input costs beyond typical support and success functions.

Gross Margins, Part 1: Cloud costs can be massive for AI

  • AI is very demanding on cloud infrastructure: training, inference, and retraining due to data-drift, large file size input data and model transfer across cloud regions are more demanding than traditional software operations.

  • Combined, 25% (or more) of revenue are spent on cloud resources. Manual processing can be cheaper in a company’s early start-up phase.

  • Model complexity is growing at fast rate. 10x data is needed for 2x model precision. CPU/GPUs cannot be accelerated and made cheaper fast enough. Distributed computing solves speed issues, but not costs.

Gross Margins, Part 2: Many AI applications rely on “humans in the loop”

  • Training most of today’s state-of-the-art AI models involves the manual cleaning and labeling of large datasets. This process is laborious, and expensive.

  • For many tasks, especially those requiring greater cognitive reasoning, humans are often plugged into AI systems in real time.

  • The need for human intervention will likely decline as the performance of AI models improves. It’s unlikely that humans will be removed entirely.

  • How much will margins increase with AI fully automated? Reducing human intervention (and their costs) will likely increase cloud costs.

Scaling AI systems can be rockier - AI lives in the long tail

  • New customers increases the work for the ML team when the models have to be adjusted for these new customers -> drawing resources away from new sales. The problem is edge cases. Many AI apps have open-ended interfaces and operate on noisy, unstructured data (e.g., images, NLP).

  • Handling edge cases means endless possible input values. Each new customer deployment is likely to infer on completely new data.

  • AI companies spend time re-training the model for each new customer to eliminate edge cases prior to deployment until accuracy reaches acceptable levels.

  • Startups spend more time + resources for deployment than expected. Identifying challenges in advance can be difficult since traditional prototyping tools (mockups, prototypes, beta tests) do not cover edge cases.

The how to defend AI businesses has to be figured out

  • Some of the best defensive moats are network effects, high switching costs, and economies of scale. Additionally, branding comes from first to market and near-exclusivity.

  • For AI players, differentiation is harder: New model architectures are developed mostly in open, academic settings. Reference implementations (pre-trained models) are available from open-source libraries, and hyperparameters can be optimized automatically.

  • While data is at the heart of AI, it is often owned by customers, in the public domain, or over time becomes a commodity.

  • Moats for AI companies appear to be shallower than for software and shallower than expected.

Building, scaling, and defending great AI companies – practical advice for founders

  • Reduce model complexity: If possible, share a single model (or set of models) among all customers. It is easier to maintain, faster to roll out, and supports a simpler, more efficient engineering organization. Result: minimize cloud costs.

  • Choose problem domains carefully and narrowly: The minimum viable task for AI models is narrower than expected. Choose problems that are hard for humans but easy for AI. That is high scale, low complexity tasks.

  • Plan for high variable costs: build a business model and Go-To-Market strategy with lower gross margins in mind. Understand the distribution of the model data. Model maintenance and human failover are first-order problems. Measure real variable cost (no hiding in R&D). Make conservative assumptions in financial models (esp. fundraise). Scale and (outside) tech advances may not solve the problem.

  • Embrace services: Building hybrid businesses (software/AI plus service) is harder than pure software, but this approach can provide deep insight into customer needs and yield fast-growing, market-defining companies.

  • Plan for change in the tech stack: AI is still new. Standardized tools are just now being built: tools to automate model training, make inference more efficient, standardize developer workflows. Tightly coupling an application to the current state may lead to an architectural disadvantage in the future.

  • Build defensible: It is unclear whether an AI model or the underlying data is a long-term moat. Good products and proprietary data almost always build good businesses. AI may vitalize old markets. Effective cloud strategies are key.

Summary

Most AI systems are not like software and such the business works differently. AI businesses scale not as easily, and it is harder to build defensibility. Taken together, that makes AI feel more like a service business. These are unfamiliar patterns, which indicate that AI companies are truly something new, including massive opportunities.

Taming the Tail - Adventures in Improving AI Economics - a16z

Taming the Tail - Adventures in Improving AI Economics - a16z

Summary from reading the Andreessen Horowitz blog


Share:        
  1. Preface
  2. Introduction
  3. Part I: Understanding the problem to be solved
    1. Building vs. experimenting (or, software vs. AI)
    2. The long tail and machine learning
    3. Impact on the economics of AI
  4. Part II: Building Better AI Systems
    1. Toward a Solution
    2. Easy mode: Bounded problems
    3. Harder: Global long tail problems
    4. Really hard: Local long tail problems
    5. Table stakes: Operations
  5. Summary

The writeup below are some highlights I wrote for myself. If you find the writeup useful and want to know more, then you should absolutely read the article linked.

Preface

This is my personal summary from reading Taming the Tail: Adventures in Improving AI Economics, which was published on the Andreessen Horowitz blog a16z.com, a Silicon Valley venture capital (VC) investor. This VC supported the likes of Skype and Facebook when they were still little known and dozens of others whose digital services we are using in our daily lives. The article is excellently written, and part of the AI in 2020 series. This series is a distillation of dozens of interviews the authors lead with ML tams and thus marks a great resource for machine learning professionals who want deep insights into new AI developments, be it startups or mature companies. And last but not least it helps evaluating the next job opportunities. Most of all, the series is intended by Andreessen Horowitz to support AI companies, a sector in which they are investing heavily.

Introduction

AI companies is different from traditional software companies. Often they have lower margins, are harder to scale, and don’t always have strong defensive moats.

Part I: Understanding the problem to be solved

(…in the presence of long-tailed data distributions)

Building vs. experimenting (or, software vs. AI)

Traditional software is a task of engineering. The specifications of a product are defined and then slowly added to the software - one line of code at a time.

AI developers build statistical models, measure them, build a new one, measure, and so on. Until a satisfactory quality threshold was crossed, the AI modeler has to go through this experimentation loop. The article quotes an AI CTO that the process is “closer to molecule discovery in pharma” than engineering. Another issue is that the code written by the AI developer only indirectly influences the output, which adds an additional layer of complexity. It mainly governs how the dataset is processed, but the quality of the outcome is unsure.

The long tail and machine learning

Many ML natural datasets have a long-tailed (skewed) frequency distribution. The long tail makes up ~70% of the total amount of samples. This makes it when picking randomly much more likely to receive a sample from the tail. Head and Middle, where the most common samples are, receive only 30% of picks.

However, the low-frequency samples from the tail are harder to train with supervised learning, while value is relatively low. These long-tail samples are often also called “edge cases” and are much harder to get right.

Impact on the economics of AI

  • Improving models on the long tail costs a lot of compute resources and thus creates more (cloud) costs than traditional software would cause (3-5x higher).

  • Data has a cost to collect, process, and maintain. That means the larger the long-tailed dataset becomes, the harder it becomes to work with it and more costly, too. The projects actually show “DISeconomies of scale”. While more data to improve a model is often available, the long-taildness also means a diminishing return. AI developers may need 10x more data to achieve a 2x subjective improvement. While computing power becomes cheaper and more available (Moore’s Law), this does not occur with model performance at the moment.

Part II: Building Better AI Systems

Toward a Solution

Economics of AI are a function of the problem (dataset modeling/characterizing causes computing costs), not part of the technology used (computing efficiency gains). Hence, the approach of the long-tail frequency distribution has to be optimized.

Easy mode: Bounded problems

First of all, it has to be determined if a long-tailed frequency dataset is there at all. If not, then no ML or DL should be used. Linear or Polynomial Regression will be much more efficient.

Logistic Regression and Random Forests are popular because they are A) interpretable, B) scalable, C) computationally efficient. DL may have its use cases in NLP (text generation and understanding), sentiment analysis but accuracy improvement has to be tightly balanced with training and maintenance costs.

ML is not always the best solution to everything, which should be very closely evaluated. Heuristic and if/then approaches may be much easier to implement.

Harder: Global long tail problems

Serving a single (global) model if there is large customer-segment/region overlap is much cheaper than building multiple models. -> “One size fits all”

Tactics to build around the long tail:

  • Optimize the model -> add more (customer) training data, adjust hyperparameters, tweak architecture.
  • Narrow the problem to the fat head -> restrict user-input complexity to avoid creating an unnecessary tail -> e.g. auto-complete text input for the user.
  • Convert the problem -> human intervention for special edge cases or create single-turn interfaces that cover the long tail more easily.

Componentizing:

Train different simpler models to do different jobs concurrently. Example: Couldflare using 7 models to detect 6-7 different bot types attacking websites. Each model is global, and relatively easy to maintain.

Really hard: Local long tail problems

It is common to observe local problem (data) variation with customers, which do have to be addressed. Effort in modeling is based on the degree of local variation (countries, specific large customers….). Example: streaming services generating playlist models for each country.

How to bring the benefits of global models to local problems?

  • Adoption of the meta-model technique, which covers multiple clients and tasks: e.g., combine thousands of user-specific models into a single, cheaper meta model.
  • Transfer Learning: Using pre-trained models simplifies training needs and makes fine-tuning per customer easier. Example: GPT-3 NLP model that needs to be adjusted for local needs.
  • Join similar model with a common trunk to reduce complexity. Making the trunk as thick as possible, the branches as thin as possible without decreasing accuracy. Example: Facebook combining models for furniture, cars, fashion which is cheaper and more accurate. Benefits: parallel model training and high local accuracy, yielding a global-pattern-like model.

Table stakes: Operations

Operational best practices to improve AI economics:

  • Consolidate data pipelines. Do not use one pipeline per model. Combine multiple customers into one ETL process, re-train less often (only at night, or when enough data available).
  • Edge case engine, a collection of odd long-tail samples and systematically collecting long-tail samples.
  • On-Prem ML clusters with GPUs can unlock massive savings. Example: startup saved 10 Million USD with their own cluster in colocation facility compared to AWS.
  • Model quantization, distillation, pruning, compression, and compilation (including APIs and pre-trained models)
  • Frequent testing, not only accuracy, F1 etc. but also on A) bad data, B) privacy violations, C) precision mismatches, D) model drift, E) bias, and F) adversarial tactics.

Summary

AI is still in its early years and has not yet crossed the peak of the hype cycle. More efficiency is needed. Structures similar to software development will likely appear

Pagination


© 2023. All rights reserved. Hosted on GitHub, made with https://hydejack.com/