Sunday, December 7, 2014

Apache Spark Day One - Installation

Install Spark with Mac OS10.9

1. install scala
For Mac OS, it pretty easy with brew

brew install scala
(The initial installation failed, and it was fixed by installing hadoop)

 go to the spark path
./sbt/sbt clean assembly

3. go to Mac setting, enable remote login
if all steps are done successfully, start the master by 
./sbin/start-master.sh

The GUI would be available at localhost:8080 

4. configuration
 - create spark-env.sh at /conf/spark-env.sh
  try 4 slaves: in the end of the file, add "export SPARK_WORKER_INSTANCES=4" to start 4 workers
   check the GUI, it shows the workers (you might need to input ssh keys for each worker).
  

./bin/stop-all.sh #stop all masters and slaves
./bin/start-all.sh #start all masters and slaves

-  configure logs, conf/log4j.properties

5. now, it is all set up, test python scripts by 
./bin/spark-submit test_pythonscript.py



Monday, October 13, 2014

Connet python to AWS -- boto

Boto is the python module to interface with Amazon Web Service. The docs list all AWS features work with Python 2.6 and 2.7. I start with S3 with python 2.7.

1. install boto (the easiest way is to use pip)

pip install boto

2. go to AWS and get access key and secret access key (set up both user and group)

3. run the example for testing
python s3_sample.py

Before the run, change in the source code:
 
aws_access_key_id = YOUR_ACCESS_KEY_ID
aws_secret_access_key = YOUR_SECRET_ACCESS_KEY
s3 = boto.connect_s3(AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)

4. try basic s3 functions

from boto.s3.connection import S3Connection
from boto.s3.key import Key

#setup connection with AWS credentials
s3 = S3Connection(AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY) 

 #create bucket (you could think it as a cloud folder on S3)
b= s3.create_bucket('botobucket_1230')
  Note: bucket name has to be unique for all buckets on AWS (similar with url). That means you have to come up with a name that has not been taken by others.

 #create Key object that used to keep track of data stored in S3 (you could also think it as a filename)
k = Key(bucket)
k.key = 'testKey'

#write content from a string to a key (a file in a bucket)
k.set_contents_from_string('this is a test string')

#validate if a key exists in a bucket, if so, return the key object, otherwise return None
print b.get_key('testKey')
print b.get_key('testKey2')

#get all keys in the bucket
for k in b.list():
    print k

#copy content from / to a local file
source = 'path to localfile.txt'
source_size = os.stat(source).st_size # file size
print source, source_size
k.set_contents_from_filename(source)
k.get_contents_to_filename('copy2local.txt')
Note: for large data set, FileChunkIO module can help to chuck the original file into smaller segmentations. boto docs

Tuesday, August 19, 2014

linear programming -- simplex method

I used to like Linear Programming at school, but barely use it for problems in real world. However, it could probably offer some help to RTB problems. Well, forget about RTB, let's start with an easier example. (I make up this example for easy calculating. Of course it would be much more complicated in a real problem.)

We have three manufacture factories (M1, M2, M3) that can make two types of products: A and B. M1 and M2 can make product A; M1, M3 can make product B. To make each A, it needs 2 hours at M1, 4 hours at M2; tow make each B, it needs 2 hours at M1 and 5 hours at M3. The profit is $2 for each product A and $3 for each product B. However, M1 can afford 12 hrs at most a day; M2 can afford 16 hrs at most; M3 can afford 15 hrs at most. For making as much money as we can, how many product should we make for product A and B respectively each day?

factorieshr cost (product A)hr cost (product B)affordable hrs
M12212
M24016
M30515

If we put it in a math way:
objective:    max  2*x1 + 3*x2
 sub             2*x1 + 2* x2 <= 12
                   4*x1              <= 16
                               5*x2  <= 15

Of course it can transform into a min optimization based on LP duality.
Here I want to show the solution by simplex method. First, let's make all condition as equations by adding three variables x3, x4, x5
             2*x1 + 2* x2 + x3 = 12
             4*x1             + x4 = 16
                         5*x2  + x5 = 15

as we put them in a matrix and do the pivots
pivot 1:
the purpose is to find out the basic and nonbasic variable that can improve the solution
- step1:  find max in z_row (that can contribute to improve the solution). that is 3. so x2 is the basic variable (i.e. entering variable)
- step2  in the column (where z=3), for positive x(i,j) find min{x(i,j)/c}, that would be row_x5. so x5 is the nonbasic variable (i.e. departing variable)
detail of pivot1:
 - row_x5 = row_x5/5
 - row_x3 = row_x3 - row_x5 * 2
 - row_z = row_z - row_x5*3
x1x2x3x4x5c
x32210012
x44001016
x50500115
z23000

after pivot1, x2 switched to the left as basic variable
same way, do pivot 2:
x1x2x3x4x5c
x32010-2/56
x44001016
x201001/53
z2000-3/5


pivot 3
x1x2x3x4x5c
x1101/20-1/53
x400-214/54
x201001/53
z00-10-1/5


now, x1, x2 are both basic variables on the left, so the solution is: when x1 =3 and x2 =3, the problem have max value (Note: not all LP problems have feasible solution).

Finally, to make life easier, python has a module for the simplex method.
PyGLPK   http://tfinley.net/software/pyglpk/discussion.html
Here is an example of PyGLPK using simplex method  http://tfinley.net/software/pyglpk/ex_maxflow.html
Other resource: PuLP  http://www.coin-or.org/PuLP/

Tuesday, July 22, 2014

taste of python Requests


I barely use REST api with Python, but recently I found a great (and easy) Python library for HTTP/REST APIs, Python Requests.

It needs zero effort for installing, with pip or easy_install ("easy_install requests" for my MacOS)

First try with GET
 import requests  
 r = requests.get('https://example.com')  


GET with parameter
 param = {‘user_id’:12345}  
 r = requests.get('https://example.com’, param)  
 #to get the content of the response  
 print r.text  
 # it can also parse to json data  
 data = json.loads(r.text)  


check with status code
 print r.status_code  

for bad request, raise exceptions
 print r.raise_for_status()  


Session objects are pretty helpful
I used POST to carry to auth cookies through from authentication 
 s = requests.Session()  
 url = ‘api.example.com’  
 info = {‘user’:”usename’, ‘password’:’mypassword’}  
 r = s.post(url, data = info)  

I got an error at the first beginning 
 requests.exceptions.SSLError: hostname 'api.example.com' doesn't match 
 either of ‘example.com', 'www.example.com’  

It seems the POST method checked the host’s SSL certificate. In this case we just need set the verify flag as False
 r = s.post(url, data = info, verify=False) 


dah-dah!! it got through, then I can do GET, POST and DELETE to play with the data through the api


useful resource: handy cheat sheet for beginners

Thursday, July 10, 2014

D3 "translate", easier way to assign position

"transform"+ "translate" seems an easier way to assign components' positions in D3. It works the same way as (dx, dy). Here is am example of drawing circles.

usually, components are located with specified dx, dy


















An much simpler version could be done with "translate"















both actually do the same job