Party with R: How the Community Enabled Us to Write a Book

[This article was first published on R Views, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t. Isabella C. Velásquez, MS, is a data analyst committed to nonprofit work … Read more

Categories R Tags ExcerptFavorite

87th TokyoR Meetup Roundup: {data.table}, Bioconductor, & more!

As the monsoon season (finally) ends, another TokyoR meetup! Since COVIDhit all of TokyoR’s meetups since February have been done online and thetransition has been seamless thanks to the efforts of the TokyoRorganizing team. It was my first TokyoR since January so it was greatto be back! In line with my previous round up posts: … Read more

Categories R Tags ExcerptFavorite

What’s the Difference Between a Data Analyst, Data Scientist, and a Machine Learning Engineer?

This article uses the metaphor of a track team to differentiate between the role of a data analyst, data scientist, and machine learning engineer. We’ll start with the idea that conducting a data science project is similar to running a relay race. Hopefully, this analogy will help you make more informed choices around your education, … Read more

Effect of COVID-19 on Our Mobility

I just moved to NYC in January 2020. My reason for moving was to enjoy the socially rich community and the plethora of activities that I could explore, such as museums, cute little theaters, Broadway shows, fancy stores, riding the subway and watching people, restaurants and the fun nightlife. And I did for two months. … Read more

Categories R Tags ExcerptFavorite

Hierarchical Clustering: Agglomerative and Divisive — Explained

An overview of agglomeration and divisive clustering algorithms and their implementation Photo by Lukas Blazek on Unsplash Hierarchical clustering is a method of cluster analysis that is used to cluster similar data points together. Hierarchical clustering follows either the top-down or bottom-up method of clustering. Clustering is an unsupervised machine learning technique that divides the … Read more

Visualizing the Nothing

I’ll run the code in Jupyer Lab, and I’ll use Pandas, Numpy, Missingno, and Matplotlib for this example. The dataset will be California Jail Profile Survey, which contains monthly county-level data from 1995 to 2018. import pandas as pdf = ‘data/california_jail_county_monthly_1995_2018.csv’df = pd.read_csv(f) After loading the dataset to Pandas, we can have a look at … Read more

When to use PowerShell instead of shell scripts?

Here you can find the full documentation about Powershell installation. Ok, no more talk, let’s code. Backing to our scenario, when we want to replace a piece of information in an existing file. And let’s say that we have a json with multiple levels like this: Assuming that we want to replace the “fileName” for … Read more

Comprehensive Guide to the Data Warehouse

Data science can’t start until the data cleaning process is complete. Learn about the role of the data warehouse as a repository of analysis-ready datasets. Hunting for clean data in the enterprise setting. Photo by Hu Chen on Unsplash. As a data scientist, it’s valuable to have some idea of fundamental data warehouse concepts. Most … Read more

A Beginner’s Guide To Data Science

· Data Scientist: A data scientist handles large quantities of data to produce compelling visions for the particular business. They make use of various algorithms, tools, methods, and processes. A Data Scientist deals with programming languages like R, Python, SAS, SQL, Matlab, Spark, Hive and Pig. · Data Analyst: They mine vast quantities of data. … Read more

Retinal Images are Weirdly Predictive

Keep your eyes open for these developments in deep learning Photo by nrd on Unsplash If you’ve ever been to see an opthamologistst, you’ve probably undergone a routine procedure where a specialist takes a picture of the back of your eye. You will not be surprised to hear that retinal images are rather handy for … Read more

Automatically Locking & Unlocking Ubuntu with Computer Vision Using a Human Face!!!

Generate face As a first step, we have to generate training images to train the models. I have created a python file to generate our face image. When you execute this file, it will ask your name in the command line then the system will create a folder with the given name. number=0;frame_count=0detector = dlib.get_frontal_face_detector()print(“enter … Read more

Customer Experience, Artificial Intelligence and Machine Learning

Anthony, Paul and Ricky all agreed that a huge challenge for businesses is not having a solid data infrastructure, or a deep understanding of what exactly should be measured to achieve business goals and customer satisfaction. “Many companies approach us seeking to use AI as a ready-made silver bullet for a business problem. Others come … Read more

Introduction To Recommender Systems- 2: Deep Neural Network Based Recommendation Systems

The above matrices on the inner product give back the interaction matrix. Now, the paper concentrated on implicit feedback, for example, if a user has interacted with an item its positive feedback, as considered here. But in actual cases, 1 here may not directly imply the user liked the content. Similarly 0 means, there was … Read more

How a Text’s Network Structure Can Reveal Its Mind Viral Immunity

Mind Viral Somatic Immunity Workshop, EightOS + InfraNodus Any text can be represented as a network. The words are the nodes; the words’ co-occurrences are the connections between them. The resulting network structure can reveal how immune or susceptible the discourse is to external influence. The more diverse a text network is (in terms of … Read more

Why XGBoost Is So Effective?

When you have tons and tons of data, it will be difficult to fit it all into a computer at one time. In this condition, things like sorting and finding quantiles will become very SLOW! To solve this problem, an algorithm called SKETCH can provide approximate but fast solution. General speaking, a sketch of big … Read more

9 Cool Julia Tricks In 4 Minutes

The best features of the Julia REPL Photo by Filip Baotić on Unsplash Julia is a new scientific computing language that is as easy to learn as Python, but executes as fast as C. Julia is a compiled language but since it uses a just in time compiler, (like Java), your code can be executed … Read more

A Scenic Look at the Julia Language

Experience Julia without having to do anything hard Photo by Nellia Kurme on Unsplash Julia is a new multi-purpose programming language designed to solve the “two language problem” by providing ease of use and speed. The other two most popular new languages developed in the 2010s, Go and Rust, both look like cleaned up C … Read more

How To Deal With Imbalanced Classification, Without Re-balancing the Data

Before considering over-sampling your skewed data, try simply tuning your classification decision threshold Photo by Elena Mozhvilo on Unsplash In machine learning, when building a classification model with data having far more instances of one class than another, the initial default classifier is often unsatisfactory because it classifies almost every case as the majority class. … Read more

Here’s how I Learned Just Enough Programming for Data Science

My guess is that you’ve chosen the Python route, and that’s great for several reasons: The language is simple to learn — more beginner-friendly than Java/Go It’s the most widely used language for data science It’s a general-purpose language — not limited to statistical tasks As an aspiring data scientist, Python will suit you just … Read more

3 Python Concepts in under 3 minutes

Using the context of video game development for intuition Photo by Pixabay from Pexels In this article, I touch on 3 key concepts for developing code in Python. I will be using the context of game development as an example for each concept because I believe it makes for the most intuitive understanding. Let’s get … Read more

Weathering the Storm

[This article was first published on R | Quantum Jitter, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t. Covid-19 began battering the financial markets in February. Which sectors … Read more

Categories R Tags ExcerptFavorite

Is the race over for Seq2Seq models?

Ideation of Seq2Seq or sequence-to-sequence models came in a paper by Ilya Sutskever et.al. in “Sequence to Sequence Learningwith Neural Networks”. They are essentially a certain organization of deep sequential models (a.k.a. RNN based models) (e.g. LSTMs/GRUs)[1] (discussed later). The main type of problems addressed by these models is, mapping an arbitrary length sequence to … Read more

August Edition: Journalism In A World of Data

Unravelling the complexities of modern life, one bit at a time. Photo by Matthew Guay on Unsplash The realisation that data is a double-edged sword can be a hard pill to swallow. From unveiling the behaviour of our communities to creating entirely hypothetical realities, our community here at Towards Data Science has the luxury of … Read more

Plotting w/ Pandas and PPP Loan Data

For this example, we will be using the SBA Paycheck Protection Program Loan Level Data. In March 2020 U.S. lawmakers agreed to a stimulus bill worth $2 trillion dollars. The package included $1,200 payments to adults and $500 per child for households making up to $75,000 $500 Billion fund loans for corporate America with every … Read more

Create and customize boxplots with Python’s Matplotlib to get lots of insights from your data

Boxplot highlighting outliers. Visualized in a boxplot outliers typically show up as circles. But as you’ll see in the next section, you can customize how outliers are represented 😀 If your dataset has outliers, it will be easy to spot them with a boxplot. There are different methods to determine that a data point is … Read more

Pearson and Spearman Rank Correlation Coefficient — Explained

Photo by M. B. M. on Unsplash Relationship between random variables. Correlation Coefficient is a statistical measure to find the relationship between two random variables. Correlation between two random variables can be used to compare the relationship between the two. By observing the correlation coefficient, the strength of the relationship can be measured. The value … Read more

Monte Carlo integration in Python

A famous Casino-inspired trick for data science, statistics, and all of science. How to do it in Python? Image source: Wikipedia(Free) and collage made by the author Disclaimer: The inspiration for this article stemmed from Georgia Tech’s Online Masters in Analytics (OMSA) program study material. I am proud to pursue this excellent Online MS program. … Read more

Don’t Forget what ‘Deep’ & ‘Learning’ Actually Mean

Think critically about whether you need to apply deep-learning to your datasets. Before you apply deep-learning to your customer data as an AI startup, think critically on the basic statistics of your customer data first–note down biases in the set–and ask yourself if deep-learning is really necessary to use on that set, or if you … Read more

Algorithmic Trading 101

In this article, we will explore multiple technical indicators for time-series stock price data and a theoretically optimal strategy that has been tuned to beat the benchmark strategy. The indicators developed in this project will be used to design an intuition-based and machine learning-based trading strategies in the forthcoming projects. Indicators use price and volume … Read more

Announcing PyCaret 2.0

https://www.pycaret.org We are excited to announce the second release of PyCaret today. PyCaret is an open source, low-code machine learning library in Python that automates machine learning workflow. It is an end-to-end machine learning and model management tool that speeds up machine learning experiment cycle and makes you more productive. In comparison with the other … Read more

Visualisation options to show growth in occupations in the Australian health industry by @ellis2013nz

Visualising growth in occupations in one industry A chart is doing the rounds purporting to show the number of administrators working in health care in the USA has grown much faster than the number of physicians – more than 3,000% growth from 1970 to 2009 for administrators (allegedly) compared to about 90% or so for … Read more

Categories R Tags ExcerptFavorite

Using Dropout with Neural Networks: Not A Magic Bullet

Dropout is a regularization technique that is designed to prevent overfitting in a neural network. However, it shouldn’t be applied arbitrarily. Source: Image Created By Author Overfitting is an issue that occurs when a model shows high accuracy in predicting training data (the data used to build the model), but low accuracy in predicting test … Read more

ROC Curve and AUC — Explained

What they mean and when they are useful Photo by Markus Spiske on Unsplash ROC (receiver operating characteristics) curve and AOC (area under the curve) are performance measures that provide a comprehensive evaluation of classification models. ROC curve summarizes the performance by combining confusion matrices at all threshold values. AUC turns the ROC curve into … Read more

How to manage credentials and secrets safely in R

Category Tags If you have ever received an embarrassing message with a warning saying that you may have published your credentials or secrets when publishing your code, you know what I’m talking about. A very common mistake among noob coders is (temporarily) hardcoding passwords, tokens, secrets, that should never be shared with others, and… shared … Read more

Categories R Tags ExcerptFavorite

How We Made Profits Forecasting Wind Energy Production Levels

We learnt from the experiments using Model V1 that our engineered features consistently gave lower test losses and higher profits as compared to the benchmark. In this final round of experimentation, we thus stuck with all our engineered features and experimented with more sophisticated network features like input scaling, dropouts and L2 regularisation. This time, … Read more

Handling Categorical Data, The Right Way

Source: Jelleke Vanooteghem The most underrated way of encoding data and what you are doing wrong Categorical data is simply information aggregated into groups rather than being in numeric formats, such as Gender, Sex or Education Level. They are present in almost all real-life datasets, yet the current algorithms still struggle to deal with them. … Read more

Everything Has Its Price — How to Price Words for Ad Bidding, etc

This article sketches an NLP approach to pricing natural language words or phrases. It leverages creatively (1) the model word2vec, which learns the context and associations between words from a given corpus; (2) the Mondovo dataset, which provides basic building blocks for us to further bootstrap our application. This solution will have interesting applications in … Read more

Strategies for Learning Data Science

So the next question that you may have is whether you will need to learn how to code in order to become a data scientist? This really depends on the circumstances, so yes and no. Before I learn how to code, I would use this software called WEKA. It is a point-and-click graphical user interface … Read more

Estimating Year-of-Age Specific Risks for Covid-19

Age-based risk factors for Covid-19 have been reported in 10–20 year age bands. We can do better than that. It’s easy to estimate age-specific risks to the individual year, which allows people to assess their personal risk levels more accurately. We now have reasonably good data on risks associated with Covid-19, but it isn’t being … Read more

The Roots of Data Science

Thirty-five years later after Tukey’s publication, Jeff Wu said this: Statistics = Data Science? Where he proposed that statistics should be renamed “data science” and statisticians should be named “data scientists”. In today’s standards, we know that statistics alone is not of data science, but why? Because we also need programming, business understanding, machine learning, … Read more