Sunday, July 31, 2016

Examples of algorithm


Statistical Arbitrage --- two highly correlated time series diverge for 900 ms, project reverting back

Iceberg behavior --- hiding large noticeables as scrambled noise, disguise trend. 

Predatory behavior --- active strategically lure into a pre-determined path for the kill. 

Stuffing distraction behavior--- keep other occupied with useless trial balloon, 50 ms advantage.

Augmented Intelligence -- interaction between biology and algorithm/math, low speed visual plausible insight guide brutal force algo

Wednesday, July 27, 2016

76 Data Science Interview Questions


https://www.dezyre.com/article/100-data-science-interview-questions-and-answers-general-for-2016/184



1) How would you create a taxonomy to identify key customer trends in unstructured data?

2)         Python or R – Which one would you prefer for text analytics?

3)         Which technique is used to predict categorical responses?

4)         What is logistic regression? Or State an example when you have used logistic regression recently.

5)         What are Recommender Systems?

6)         Why data cleaning plays a vital role in analysis?

7)         Differentiate between univariate, bivariate and multivariate analysis.

8)         What do you understand by the term Normal Distribution?

9)         What is Linear Regression?

10)       What is Interpolation and Extrapolation?

11)       What is power analysis?

12)      What is K-means? How can you select K for K-means?

13)       What is Collaborative filtering?

14)       What is the difference between Cluster and Systematic Sampling?

15)       Are expected value and mean value different?

16)       What does P-value signify about the statistical data?

17)  Do gradient descent methods always converge to same point?

18)  What are categorical variables?

19)       A test has a true positive rate of 100% and false positive rate of 5%. There is a population with a

1/1000 rate of having the condition the test identifies. Considering a positive test, what is the probability of

having that condition?

20)       How you can make data normal using Box-Cox transformation?

21)       What is the difference between Supervised Learning an Unsupervised Learning?

22) Explain the use of Combinatorics in data science.

23) Why is vectorization considered a powerful method for optimizing numerical code?

24) What is the goal of A/B Testing?

25)       What is an Eigenvalue and Eigenvector?

26)       What is Gradient Descent?

27)       How can outlier values be treated?

1) To change the value and bring in within a range

2) To just remove the value.

28)       How can you assess a good logistic model?


29)       What are various steps involved in an analytics project?

30) How can you iterate over a list and also retrieve element indices at the same time?

31)       During analysis, how do you treat missing values?

33)       Can you use machine learning for time series analysis?


34)       Write a function that takes in two sorted lists and outputs a sorted list that is their union.

35)       What is the difference between Bayesian Inference and Maximum Likelihood Estimation (MLE)?

36)       What is Regularization and what kind of problems does regularization solve?

37)       What is multicollinearity and how you can overcome it?

38)        What is the curse of dimensionality?

39)        How do you decide whether your linear regression model fits the data?

40)       What is the difference between squared error and absolute error?

41)       What is Machine Learning?

42) How are confidence intervals constructed and how will you interpret them?

43) How will you explain logistic regression to an economist, physican scientist and biologist?

44) How can you overcome Overfitting?

45) Differentiate between wide and tall data formats?

46) Is Naïve Bayes bad? If yes, under what aspects.

47) How would you develop a model to identify plagiarism?

48) How will you define the number of clusters in a clustering algorithm?

49) Is it better to have too many false negatives or too many false positives?

50) Is it possible to perform logistic regression with Microsoft Excel?

51)  What do you understand by Fuzzy merging ? Which language will you use to handle it?

52) What is the difference between skewed and uniform distribution?

53) You created a predictive model of a quantitative outcome variable using multiple regressions. What are the

steps you would follow to validate the model?

54) What do you understand by Hypothesis in the content of Machine Learning?

55) What do you understand by Recall and Precision?

56) How will you find the right K for K-means?

57) Why L1 regularizations causes parameter sparsity whereas L2 regularization does not?

58) How can you deal with different types of seasonality in time series modelling?

59) In experimental design, is it necessary to do randomization? If yes, why?

60) What do you understand by conjugate-prior with respect to Naïve Bayes?

61) Can you cite some examples where a false positive is important than a false negative?

62) Can you cite some examples where a false negative important than a false positive?

63) Can you cite some examples where both false positive and false negatives are equally important?

64) Can you explain the difference between a Test Set and a Validation Set?


65) What makes a dataset gold standard?

66) What do you understand by statistical power of sensitivity and how do you calculate it?

67) What is the importance of having a selection bias?

68) Give some situations where you will use an SVM over a RandomForest Machine Learning algorithm and vice-versa.

SVM and Random Forest are both used in classification problems.


69) What do you understand by feature vectors?

70) How do data management procedures like missing data handling make selection bias worse?

71) What are the advantages and disadvantages of using regularization methods like Ridge Regression?

72) What do you understand by long and wide data formats?

73) What do you understand by outliers and inliers? What would you do if you find them in your dataset?

74) Write a program in Python which takes input as the diameter of a coin and weight of the coin and produces

output as the money value of the coin.

75) What are the basic assumptions to be made for linear regression?

76) Can you write the formula to calculat R-square?


Monday, July 25, 2016

Python interview question 2


(1) what are the output of the following:
def multipliers():
  return [lambda x : i * x for i in range(4)]
    
print [m(2) for m in multipliers()]


(2) what are the following OOP doing?

class Parent(object):
    x = 1

class Child1(Parent):
    pass

class Child2(Parent):
    pass

print Parent.x, Child1.x, Child2.x
Child1.x = 2
print Parent.x, Child1.x, Child2.x
Parent.x = 3
print Parent.x, Child1.x, Child2.x


(3)  list = ['a', 'b', 'c', 'd', 'e']
print list[10:]


(4)
d = DefaultDict()
d['florp'] = 127

(5) How Python is interpreted? How memory is managed in Python?

(6) What is the difference between list and tuple?

(7) What is Dict and List comprehensions are?

(8) How are arguments passed by value or by reference?

(9) What are the built-in type does python provides?

(10) What is module and package in Python?

(11) What is module and package in Python?

Sunday, July 24, 2016

Python Interview Questions


(1) What is Python?
     Bill Gates -- put DOS on every PC drive hardware, low level hardware language
     Steve Job -- human activity translate to App, c language 
     Zuckerberg -- The one killer app, Html/mobile native.

 Python -- high level, fit data science, if/while/for/function, class, iterator, 
.py runs on all OS/Hardware by translate into low level language 

(2) Write a function def print_directory_contents(sPath):

def print_directory_contents(sPath):
    import os                                       
    for sChild in os.listdir(sPath):                
        sChildPath = os.path.join(sPath,sChild)
        if os.path.isdir(sChildPath):
            print_directory_contents(sChildPath)
        else:
            print(sChildPath)


(3) How do you keep track of different versions of your code?

(4) Output of the following 
def f(x,l=[]):
    for i in range(x):
        l.append(i*i)
    print(l)

(5) Describe Python's garbage collection mechanism in brief.

(6) what is profiling package ?

import cProfile
lIn = [random.random() for i in range(100000)]
cProfile.run('f1(lIn)')
cProfile.run('f2(lIn)')
cProfile.run('f3(lIn)')

def f2(lIn):
    l1 = [i for i in lIn if i<0.5]
    l2 = sorted(l1)
    return [i*i for i in l2]

iterable and iterator


Python has similar concept to C# IEnumerable/IEnumerator + some RX 
 
Comprehension--[expr(i) for i in iterable] words="words make a sentence."
words.split() {[len(w) for w in words]} {factorial(x) for x in range(10)}
d={k:v} d_flip={v:k for k:v in d.items()} from math import factorial
from pprint import pprint as p p(d_flip) [w[0]:x for w in words]
{os.path.readpath(p):os.stat(p).st_size for p in glob.glob("*.py")
sum(x for x in range(1,step=1,stop_exclusive=100) if is_prime(x))

itr=itr(itra) itm =next(itr) def first(itra): return next(itr(itra))
except StopIteration return None is null first(set()) empty set None
def first(itra): try: return next(iter(itra)) return None

from math import sqrt
def is_prime(x):
    if x<2:
        return False
    for i in range(2,int(sqrt(x))+2):
        if x%i==0:
            return False
    return True

from itertools import islice, count
p1000=[x for x in range(1,1000) if is_prime(x)]
i=list(islice(p1000,5,10))  # islice=skip+take

 count() infinite lazy squence, carefuly with memory overrun by [
slice_inf= islice((x for x in count() if is_prime(x)),1000) # (tuple is fine

from pprint import pprint as pp does not help need list to convert
print(list(slice_inf)) print(sum(slice_inf))  #fast efficient

print(any(is_prime(x) for x in range(4495,4500))) 
print(all( n==n.title() for n in ["London","Boston Ma"])) # ma=false

from itertools import chain
s1=[1,2,3,4] s2=[5,6,7,8] print(list(zip(s1,s2,s1))) 
print(list(chain(s1,s2,s1))) print(list([ max(x) for x in zip(s1,s2,s1)]))

Fibonacci in Python -- lazy yield does not cost memory
def f(x): 
    yield 1
    t1=1
    t2=1
    while t1<x:
        yield t1
        t1,t2=t1+t2,t1

Generator = Lazy comprehension Iterable
g=(x for x in range(1000)) #must use () not [] as tuple RO efficient
l=list(g) realization cost 45M Ram list(g) is single use list(g) empty after
sum( x for x in ...) low memory cost re-run l=list(g) is not emoty cost Ram
statefull generator -- counter or seen=set() in memory durin iteration.
def take(n,itra) counter=0 for .. if(counter==n) return counter++ yield itm
def distinct(itra) seen=set() for .. if i in seen: continue yield itm seen.add(itm)
pipeline for i in take(10,distinc(itra)

Python fast track for Interview


from math import * import random #Python 3.4.1/IDLE
while int(num)>10:  # : is code block
    print("too big") # clean code by indentation
    num=input()
# code block end by indentation, not space
message="see {0} and {1}" print(message.format(1,random.randint(1,100)))
def load_data(): return "1" load_data() #important (): ()
if: elseif: else: for c in w: while True: break try: except: finally:
list c=["Red",'red',"Blue"] c.sort() reverse() append("R") count("R") len(c)
w="_"*6 for c in w: print(c," ",end="") str_as_list=list("cheese"),lst=[]
for i in range(1,100): "K" in "Kol"  "this".[:1] is t .[2:]
tuple vs list () vs [] --multi type c=[1,'cat'] t= 1, "cat" t=(1,"cat")
append,c[0]=2 vs. readonly unpacking i,str=t def get_t(): return i, str
[] help () -- c,i=get_cmd_item_tuple() c=line[:1] i=line[2:] return i,c
if c=="d" and i in c.remove(i)  # tuple faster, list more functionality
[],(),{},{:} --list,tuple,set()=unordered list no index, no append but add
dict() key index d[k]=v no add/append, list append not add. d={} is dict not set
s1^s2 s1&s2 s1-s2 s1|s2  len() in works for []/()/{}/{:} in key not val.
del lst[0] del d[0] s={1,2,3,"str"} s.pop() is remove s.remove("str")
class Person: def run(self): p=Person() p.run() def __init__(self,n=""):
self.dynamic_name=n  p.dynamic_name="allen" >>p1 mem hex id(p1) int p1 is p2
self._p is private naming convention only raw representation=toString() type(c)
def __repr__(self):return "{0} {1}".format(Person,self.__dict__) __dict__ prop
file structure --- main.py \package1\module1.py must have __init__.py init pkg
import package1  p=package1.module1.Person() # avoid long name???
>>__name__ shows '__main__' print(__name__) shows '__package1.module1__'
from hr import person p=person.Person() # * could be bad practice
from pkg import (m1,m2) from  urllib.request import urlopen #symbol=def
from  urllib.request import urlopen # one way to avoid long name
with urlopen("http://www.cnn.com") as story: #symbol used without pkg.mod
    for line in story:
        print(line.decode('utf8'))

def main(): """docstring seen in help def and module level"""
if __name__=="__main__": main() #module used as main.py so define entry point
sys.argv[0] is main.py argv[1] is 1st param #!/usr/bin/env python3 #Shabang
__file__ sys.argv=[sys.argv[0],0] IDLE command line except (ValueError, TypeE):
try: import msvcrt def f(): except: print("import err") alt-3/4 comment
except (ValueErr,ArithmeticError) as e: print(e) print(str(e))
raise pass print("{}".format(str(e)), file=stderr) raise ValueError()
p=os.getcwd() os.chdir(p) os.mkdir("test")

r=@ in C# r"c:\working\bash"  list("test it")=>[t,e,s,t,' ',i,t] 
str(b'\345')=>b'\\xes" '\345' is UTF-8
network pass along b'data' need .encode/.decode('utf-8')
value vs id=ptr == vs is p==q T same content p is q F bcoz id(p)!=id(q)
optional param eval only once def f(m=[]) m.append('a')=>'aaa' on f(),f()
f(c=ctime) same problem so m=None instead.
LEGB rule --- gc=0 def set() gc=3 not change gc since it is local scope.
 def set2() global gc gc=3 will change gc
Help system --- import words dir(words)=>'__fetch_words',... dir(itertools)
everything is object including module and symbols
t=(39) is int t=(39,) is tuple, unpacking (a,(b,c))=(1,(2,3))
p=1,2,3 type(p) shows tuple not []
join vs split ---';'.join(['a','b','c']).split(';')
"x of y of z".partition('of')=>('x','of','y of z') only split by 1st