LovePinkLol-Machine Learning: April 2016

Wednesday, April 27, 2016

Machine learning / data science 面经以及一些总结

本着国人互助以及传递正能量的真理，发一下我个人找工作过程中整理的machine
learning相关面经以及一些心得总结。楼主的背景是fresh CS PhD in computer
vision and machine learning, 非牛校。

已经有前辈总结过很多machine learning的面试题(传送门： http://www.mitbbs.com/article/JobHunting/32808273_0.html)，此帖是对其的补充，有一小部分是重复的。面经分两大块：machine learning questions 和 coding questions.

Machine learning related questions:
- Discuss how to predict the price of a hotel given data from previous
years
- SVM formulation
- Logistic regression
- Regularization
- Cost function of neural network
- What is the difference between a generative and discriminative algorithm
- Relationship between kernel trick and dimension augmentation
- What is PCA projection and why it can be solved by SVD
- Bag of Words (BoW) feature
- Nonlinear dimension reduction (Isomap, LLE)
- Supervised methods for dimension reduction
- What is naive Bayes
- Stochastic gradient / gradient descent
- How to predict the age of a person given everyone’s phone call history
- Variance and Bias (a very popular question, watch Andrew’s class)
- Practices: When to collect more data / use more features / etc. (watch
Andrew’s class)
- How to extract features of shoes
- During linear regression, when using each attribute (dimension)
independently to predict the target value, you get a positive weight for
each attribute. However, when you combine all attributes to predict, you get
some large negative weights, why? How to solve it?
- Cross Validation
- Reservoir sampling
- Explain the difference among decision tree, bagging and random forest
- What is collaborative filtering
- How to compute the average of a data stream (very easy, different from
moving average)
- Given a coin, how to pick 1 person from 3 persons with equal probability.

Coding related questions:
- Leetcode: Number of Islands
- Given the start time and end time of each meeting, compute the smallest
number of rooms to host these meetings. In other words, try to stuff as many
meetings in the same room as possible
- Given an array of integers, compute the first two maximum products(乘积)
of any 3 elements (O(nlogn))
- LeetCode: Reverse words in a sentence (follow up: do it in-place)
- LeetCode: Word Pattern
- Evaluate a formula represented as a string, e.g., “3 + (2 * (4 - 1) )”
- Flip a binary tree
- What is the underlying data structure for JAVA hashmap? Answer: BST, so
that the keys are sorted.
- Find the lowest common parent in a binary tree
- Given a huge file, each line of which is a person’s name. Sort the names
using a single computer with small memory but large disk space
- Design a data structure to quickly compute the row sum and column sum of
a sparse matrix
- Design a wrapper class for a pointer to make sure this pointer will
always be deleted even if an exception occurs in the middle
- My Google onsite questions: http://www.mitbbs.com/article_t/JobHunting/33106617.html

面试的一点点心得：
最重要的一点，我觉得是心态。当你找了几个月还没有offer，并且看到别人一直在版
上报offer的时候，肯定很焦虑甚至绝望。我自己也是，那些报offer的帖子，对我来说
都是负能量，绝对不去点开看。这时候，告诉自己四个字：继续坚持。我相信机会总会
眷顾那些努力坚持的人，付出总有回报。
machine learning的职位还是很多的，数学好的国人们优势明显，大可一试, 看到一些
帖子说这些职位主要招PhD，这个结论可能有一定正确性。但是凭借我所遇到的大部分
面试题来看，个人认为MS或者PhD都可以。MS的话最好有一些学校里做project的经验。
仔细学习Andrew Ng在Coursera上的 machine learning课，里面涵盖很多面试中的概念
和题目。虽然讲得比较浅显，但对面试帮助很大。可以把video的速度调成1.5倍，节省
时间。
如果对一些概念或算法不清楚或者想加深理解，找其他的各种课件和视频学习，例如
coursera，wiki，牛校的machine learning课件。
找工作之前做好对自己的定位。要弄清楚自己想做什么，擅长做什么，如何让自己有竞
争力，然后取长补短（而不是扬长避短）。
感觉data scientist对coding的要求没有software engineer那么变态。不过即便如此
，对coding的复习也不应该松懈。

我个人觉得面试machine learning相关职位前需要熟悉的四大块：
Classification:
Logistic regression
Neural Net (classification/regression)
SVM
Decision tree
Random forest
Bayesian network
Nearest neighbor classification

Regression:
Neural Net regression
Linear regression
Ridge regression (add a regularizer)
Lasso regression
Support Vector Regression
Random forest regression
Partial Least Squares

Clustering:
K-means
EM
Mean-shift
Spectral clustering
Hierarchical clustering

Dimension Reduction:
PCA
ICA
CCA
LDA
Isomap
LLE
Neural Network hidden layer

Tuesday, April 19, 2016

Machine Learning & Hadoop in Next-Generation Fraud Detection

Takeaway: Fraud detection has always been a priority in the banking industry, but with the addition of modern tools like Hadoop and machine learning, it can be more accurate than ever.

Source: Ajv123ajv/Dreamstime.com

Fraud detection and prevention is a real pain for the banking industry. The industry spends millions on technologies to reduce fraud, but most of the current mechanisms are based on static historical data. And it relies on pattern and signature matching based on this historical data, so first-time fraudulent acts are very difficult to detect and can cause a lot of financial loss. The only solution is to implement a mechanism based on both historical and real-time data. This is where the Hadoop platform and machine learning come into play.

Fraud and Banks

Banks are very vulnerable to fraud, as fraud is their main cause of money loss. An estimate suggests that more than $1.7 trillion is lost every year due to bank fraud. To prevent this, banks spend a lot of money on fraud prevention. However, they don’t spend much on protecting themselves. Therefore, current technologies with which banks today are equipped aren’t powerful enough. However, big data and machine learning can help to revamp the current system and lessen fraud to levels to an all-time low.

Current approaches to fraud detection have the following limitations:

Overlooking First-Time Fraud

The applications which are currently in place by banks for detecting scams are very old. In this method, the bank creates a very complex algorithm based on previous instances of fraud. This algorithm is then used in checking every transaction’s authenticity and legitimacy. This algorithm is very consistent and relies on older banking records and transaction signatures. Hence, due to this, many first-time frauds are overlooked, as they don’t have a signature. Furthermore, an algorithm isn’t very accurate, as only a small part of the whole fraud record is used for its derivation. Therefore, many frauds go undetected due to this reason.

Older Algorithms

In the case of current fraud prevention methods, proper updating of an algorithm according the most recent instances of fraud is necessary. However, often these models are updated annually because the cost and time required is so large. It is also very difficult to derive an accurate algorithm and use it. So, if the algorithm is not updated regularly, fraud can go unnoticed until the implementation of the newer algorithm, which may be deployed months or even years later.

Distinguishing Frauds From Genuine Transactions

Very often, many banks wrongly classify genuine transactions as fraudulent ones. This can be extremely harmful both for the bank’s reputation and the customer’s. This situation can really irk genuine customers who find that their transaction is canceled unnecessarily. For preventing this, the accuracy of the current fraud prevention system must be enhanced. These limitations of the current algorithmic system can be used for devising a newer system which will be much more accurate. So, such a solution is the need of the hour.

Solution to This Problem

A reliable and accurate solution is necessary to combat fraudulent transactions, while not hindering the genuine ones. This solution must be able to detect a wide variety of fraud types as each transaction takes place, and all in real time. The results must also be accurate so that legitimate transactions are not interrupted. But the real question is how the banking industry will reform its current fraud detection methods. How will it build a fraud detection application which is both efficient and fast, and can even stop those false positives that can disrupt the activities of genuine customers? The solution lies in machine learning based on big data platforms like Hadoop.

What Is Machine Learning?

Machine learning is the result of the integration between big data analysis and fraud detection. It happens when a system learns to process large data resources and also learn from its earlier experiences in the field. This helps the application to easily detect and intercept fraudulent transactions, even learning to recognize a specific kind of fraud for quicker detection in the future.

How Can Machine Learning in Hadoop Prevent Fraud?

Processing large amounts of data accurately used to be a herculean task, but with the advent of big data, several faster and more powerful data processing applications have been created. One of the most powerful of these applications is the Hadoop platform. Hadoop is extremely powerful because of its MapR feature, which allows it to easily process large amounts of data in real time, and very cheaply at that.

As Hadoop can easily process large amounts of data at once, it can be used to process all the older transaction records and signatures, and make an extremely accurate mathematical model. These transaction details can also be used to extract signatures, which will allow the bank to intercept first-time fraud transactions. However, the question which arises now is what tool can be used for processing the data and devising a perfect algorithm?

Tools for Preventing Bank Fraud

With the increase in bank fraud, a good fraud management application is the need of the hour. One of these tools is Skytree. Skytree is actually a special machine learning platform which promises to offer high accuracy and performance, even when the problem is processing large bank transaction data records. It is based on Hadoop’s MapR-type data clusters, which ensures big data processing in real time. It also can use a large variety of machine learning procedures, including supervised and unsupervised methods. Because of such efficient machine learning procedures, Skytree is able to stop fraudulent transactions with the help of an advanced model and even stop first-time frauds on the basis of its ability to intercept suspicious transactions. Skytree can automatically select the best information and use it to create a highly accurate model. It can easily analyze large amounts of data too, so it is easier to update the current model with its help.

Cons of Machine Learning

Machine learning may be a very powerful solution for fraud detection, but it can be a major challenge too. The concept is directly related to artificial intelligence. The fact that our machines will make the decisions for us may raise moral implications. However, there is no need to worry, as the application will work for us, and will make the best decisions when supervised by a human employee. Rest assured, machine learning will produce smarter fraud prevention techniques and will help prevent loss of money in the future.

Conclusion

The best fraud management application must be powerful, fast and accurate and must adapt to a variety of situations. For achieving this, the application must be able to churn out transaction details and signatures while keeping the database updated with the newest fraud types. Only a platform based on Hadoop will be able to do this, as platforms based on Hadoop are extremely fast machine learning applications which can support many different kinds of machine learning algorithms. Along with this, Hadoop-based platforms are very accurate too, so they can easily stop many instances of fraud from happening, as they can detect fraud in real time. This means that if a dedicated machine learning application is by the bank’s side, that bank has the power to be nearly invulnerable to fraud!

Machine learning and big data know it wasn’t you who just swiped your credit card

You’re sitting at home minding your own business when you get a call from your credit card’s fraud detection unit asking if you’ve just made a purchase at a department store in your city. It wasn’t you who bought expensive electronics using your credit card – in fact, it’s been in your pocket all afternoon. So how did the bank know to flag this single purchase as most likely fraudulent?

Credit card companies have a vested interest in identifying financial transactions that are illegitimate and criminal in nature. The stakes are high. According to the Federal Reserve Payments Study, Americans used credit cards to pay for 26.2 billion purchases in 2012. The estimated loss due to unauthorized transactions that year was US$6.1 billion. The federal Fair Credit Billing Act limits the maximum liability of a credit card owner to $50 for unauthorized transactions, leaving credit card companies on the hook for the balance. Obviously fraudulent payments can have a big effect on the companies' bottom lines. The industry requires any vendors that process credit cards to go through security audits every year. But that doesn’t stop all fraud.

In the banking industry, measuring risk is critical. The overall goal is to figure out what’s fraudulent and what’s not as quickly as possible, before too much financial damage has been done. So how does it all work? And who’s winning in the arms race between the thieves and the financial institutions?

Gathering the troops

From the consumer perspective, fraud detection can seem magical. The process appears instantaneous, with no human beings in sight. This apparently seamless and instant action involves a number of sophisticated technologies in areas ranging from finance and economics to law to information sciences.

Of course, there are some relatively straightforward and simple detection mechanisms that don’t require advanced reasoning. For example, one good indicator of fraud can be an inability to provide the correct zip code affiliated with a credit card when it’s used at an unusual location. But fraudsters are adept at bypassing this kind of routine check – after all, finding out a victim’s zip code could be as simple as doing a Google search.

Traditionally, detecting fraud relied on data analysis techniques that required significant human involvement. An algorithm would flag suspicious cases to be closely reviewed ultimately by human investigators who may even have called the affected cardholders to ask if they’d actually made the charges. Nowadays the companies are dealing with a constant deluge of so many transactions that they need to rely on big data analytics for help. Emerging technologies such as machine learning and cloud computing are stepping up the detection game.

It takes a lot of computing power. Stefano Petroni, CC BY-NC-ND

Learning what’s legit, what’s shady

Simply put, machine learning refers to self-improving algorithms, which are predefined processes conforming to specific rules, performed by a computer. A computer starts with a model and then trains it through trial and error. It can then make predictions such as the risks associated with a financial transaction.

A machine learning algorithm for fraud detection needs to be trained first by being fed the normal transaction data of lots and lots of cardholders. Transaction sequences are an example of this kind of training data. A person may typically pump gas one time a week, go grocery shopping every two weeks and so on. The algorithm learns that this is a normal transaction sequence.

After this fine-tuning process, credit card transactions are run through the algorithm, ideally in real time. It then produces a probability number indicating the possibility of a transaction being fraudulent (for instance, 97%). If the fraud detection system is configured to block any transactions whose score is above, say, 95%, this assessment could immediately trigger a card rejection at the point of sale.

The algorithm considers many factors to qualify a transaction as fraudulent: trustworthiness of the vendor, a cardholder’s purchasing behavior including time and location, IP addresses, etc. The more data points there are, the more accurate the decision becomes.

This process makes just-in-time or real-time fraud detection possible. No person can evaluate thousands of data points simultaneously and make a decision in a split second.

Here’s a typical scenario. When you go to a cashier to check out at the grocery store, you swipe your card. Transaction details such as time stamp, amount, merchant identifier and membership tenure go to the card issuer. These data are fed to the algorithm that’s learned your purchasing patterns. Does this particular transaction fit your behavioral profile, consisting of many historic purchasing scenarios and data points?

I buy gas only during daylight hours.Christopher, CC BY-NC

The algorithm knows right away if your card is being used at the restaurant you go to every Saturday morning – or at a gas station two time zones away at an odd time such as 3:00 a.m. It also checks if your transaction sequence is out of the ordinary. If the card is suddenly used for cash-advance services twice on the same day when the historic data show no such use, this behavior is going to up the fraud probability score. If the transaction’s fraud score is above a certain threshold, often after a quick human review, the algorithm will communicate with the point-of-sale system and ask it to reject the transaction. Online purchases go through the same process.

In this type of system, heavy human interventions are becoming a thing of the past. In fact, they could actually be in the way since the reaction time will be much longer if a human being is too heavily involved in the fraud-detection cycle. However, people can still play a role – either when validating a fraud or following up with a rejected transaction. When a card is being denied for multiple transactions, a person can call the cardholder before canceling the card permanently.

Computer detectives, in the cloud

The sheer number of financial transactions to process is overwhelming, truly, in the realm of big data. But machine learning thrives on mountains of data – more information actually increases the accuracy of the algorithm, helping to eliminate false positives. These can be triggered by suspicious transactions that are really legitimate (for instance, a card used at an unexpected location). Too many alerts are as bad as none at all.

It takes a lot of computing power to churn through this volume of data. For instance, PayPal processes more than 1.1 petabytes of data for 169 million customer accounts at any given moment. This abundance of data – one petabyte, for instance, is more than 200,000 DVDs'worth – has a positive influence on the algorithms' machine learning, but can also be a burden on an organization’s computing infrastructure.

Enter cloud computing. Off-site computing resources can play an important role here. Cloud computing is scalable and not limited by the company’s own computing power.

Fraud detection is an arms race between good guys and bad guys. At the moment, the good guys seem to be gaining ground, with emerging innovations in IT technologies such as chip and pin technologies, combined with encryption capabilities, machine learning, big data and, of course, cloud computing.

Fraudsters will surely continue trying to outwit the good guys and challenge the limits of the fraud detection system. Drastic changes in the payment paradigms themselves are another hurdle. Your phone is now capable of storing credit card information and can be used to make payments wirelessly – introducing new vulnerabilities. Luckily, the current generation of fraud detection technology is largely neutral to the payment system technologies.

How PayPal beats the bad guys with machine learning

As big cloud players roll out machine learning tools to developers, Dr. Hui Wang of PayPal offers a peek at some of the most advanced work in the field

When Amazon Web Services announced a new machine learning service for its cloud last week, it was a sort of mini-milestone. Now all four of the top clouds -- Amazon, Microsoft, Google, and IBM -- will offer developers the means to build machine learning into their cloud applications.

cloud collaboration tools gears suites business

How to integrate with the Slack platform

More than simply another collaboration solution, Slack has RESTful APIs that let you exchange data with

READ NOW

What sort of applications? As InfoWorld’s Andrew Oliver has observed, both machine learning and big data will eventually disappear as separate technology categories and insinuate themselves into many, many different aspects of computing.

Nonetheless, right now certain uses of machine learning stand out for their immediate payback.

Fraud detection is first among them, because it addresses an urgent problem that would be impractical to solve if machine learning didn't exist. To get a sense of how machine learning is combating fraud, I interviewed Dr. Hui Wang, senior director of risk sciences for PayPal. Wang holds a Ph.D. in statistics from UC Berkeley, and prior to her 11 years at PayPal conducted credit scoring research at Fair Isaac.

You can easily imagine why PayPal would be concerned about fraud, given the innumerable scams that have targeted PayPal users. As it turns out, however, PayPal has already ventured beyond fraud detection to address other areas of risk management, including “modern machine learning in the credit decision world,” which Wang says is a lot more complex -- in part due to regulatory requirements.

According to Wang, PayPal is a pioneer in risk management, although some advanced efforts are just now emerging from the lab. PayPal uses three types of machine learning algorithms for risk management: linear, neural network, and deep learning. Experience has shown PayPal that in many cases, the most effective approach is to use all three at once.

Running linear algorithms to detect fraud is an established practice, Wang says, where “we can separate the good from the bad with one straight line.” When this either/or categorization fails, however, things get more interesting:

We soon realized linear doesn’t work because the world, of course, is not linear. So instead of saying one line can separate the world of good from bad, let’s use multiple lines or curve the lines. Within that category, I guess people might be familiar with the neural network, [which imitates] how neurons work in the human world. Also, a lot of algorithms are tree-based, mimicking a human being when we have to make a judgment -- for example, we would say if it’s raining, I’ll take an umbrella.

Neural net algorithms were developed decades ago, but today’s modern computing infrastructure -- along with the enormous quantity of data we can now throw at those algorithms -- has increased neural net effectiveness by a magnitude. Wang says these advances have been essential for risk management:

We take trust very seriously. It’s our brand. We have to decide in a couple of hundred milliseconds whether this is a good person, [in which case] we will give him or her the best and the fastest and the most convenient experience. Or is it a potentially bad guy and we have to insert some friction? The recent progress on the infrastructure side enables the application of a neural network in a practical payment risk management world possible.

Quickly determining trustworthy customers and putting them in the express lane to a transaction is a key objective, Wang explains, using caching mechanisms to run relational queries and linear algorithms, among other techniques. The more sophisticated algorithms apply to customers who may be problematic, which slows down the system a bit as it acquires more data to perform in-depth screening.

This downstream process extends all the way to deep learning, which today also powers computer vision, speech recognition, and other applications. When I asked Wang for a layman’s explanation of the difference between neural nets and deep learning, she offered this explanation:

A neural net tries to mimic a human’s way of processing information. We take ABC and try to create a relationship among them, and we take a CDE and create another relationship, and then on a higher level abstract the intermediate mini-model. So it’s kind of mimicking the human thought process. But in deep learning you’re basically taking it to many, many layers. It’s not just ABCDE, there are like 3,000 features out there and then within that 3,000 there are a lot of mini-classes of features. They have all kinds of relationships and we’re just adding layers and layers of these intermediary mini-models or mini-abstractions of the information -- and in the end come up with the top level.

Wang emphasizes that you need large quantities of data to support these complex neural network structures. PayPal itself collects gargantuan amounts of data about buyers and sellers, including their network information, machine information, and financial data. The deep learning beast is well fed.

But again, PayPal does not use deep learning in isolation. It applies all three together: linear, neural network, and deep learning algorithms. Wang explains why:

Let’s take a linear algorithm. You might think it’s outdated, but it still potentially catches something the nonlinear algorithm might not be able to. So in order to get the best out of all [three], we “ensemble” them together. We have a “voting committee.” One is linear and one is nonlinear and we just ask them: What is your opinion on this file? Then we take their vote and eventually ensemble them together for our final assessment … [It’s like] taking a lot of doctors and listening to all of them. With that kind of community-based voting, hopefully something better or more accurate will come out.

Wang says she is proud to be managing the data science team at PayPal, which is on the forefront of developing machine learning and data mining technology. Because her team is so advanced, particularly in the practical application of deep learning, I couldn’t resist asking her what she thought about widely publicized warnings from Stephen Hawking, Elon Musk, and Bill Gates regarding the potential dangers of artificial intelligence in the future:

I never worry that these machines will replace humans. Yes, we can add layers, but you can talk to any machine learning scientist and they will say that the algorithm is important, but at the end of the day what really makes the difference is that a machine cannot find data automatically … There is so much data, so much variety, but the flip side is: What is useful? We still rely on human oversight to decide what ingredients to pump into the machine.

There’s a practical lesson in that statement: Now and in the future, machine learning depends not only on big data, but on the right data. Cloud infrastructure and integration presents abundant compute power for deep learning -- as well as access to unthinkably large data sets across a potentially unlimited number of domains.

What we refer to as big data and machine learning today will tomorrow simply be integrated into the fabric of computing. As the major cloud providers open up those capabilities to all developers, the stage is set for a new wave of applications that will be much more intelligent than before.