web crawler with mechanize and selenium
Mechanize is a simple crawler tool written by python. It can be used with Beautifulsoup to crawl and structure html from web. However, many contents are automatically generated by ajax (javascript), which can not be handle by mechanize. According the official introduction of mechanize, it does not support js by now.
So, we need another tool to fetch the website content (some elements are generated by js). Seleniu is good choice here. It is a web application test framework and can run the js code.
Above all, when we only need fetch static contend, mechanize would be the best choice, and seleniu can be used to deal with dynamic web.
When should I use lasso vs ridge?
from CrossValidated Q&A
For learning model, we should have to estimate a large number of parameters, and also penalize some of them to avoid overfitting. Ridge and Lasso are wildly used methods now. However, how to decide which one is suitable?
Keep in mind that ridge regression can't zero out coefficients; thus, you either end up including all the coefficients in the model, or none of them. In contrast, the LASSO does both parameter shrinkage and variable selection automatically. If some of your covariates are highly correlated, you may want to look at the Elastic Net [3] instead of the LASSO.
I'd personally recommend using the Non-negative Garotte (NNG) [1] as its consistent in terms of estimation and variable selection [2]. Unlike LASSO and ridge regression, NNG requires an initial estimate that is then shrunk towards the origin. In the original paper, Breiman recommends the least squares solution for the initial estimate (you may however want to start the search from a ridge regression solution and use something like GCV to select the penalty parameter).
In terms of available software, I've implemented the original NNG in MATLAB (based on Breiman's original FORTRAN code). You can download it from:
http://www.emakalic.org/blog/wp-content/uploads/2010/04/nngarotte.zip
BTW, if you prefer a Bayesian solution, check out [4,5].
References:
[1] Breiman, L. Better Subset Regression Using the Nonnegative Garrote Technometrics, 1995, 37, 373-384
[2] Yuan, M. & Lin, Y. On the non-negative garrotte estimator Journal of the Royal Statistical Society (Series B), 2007, 69, 143-161
[3] Zou, H. & Hastie, T. Regularization and variable selection via the elastic net Journal of the Royal Statistical Society (Series B), 2005, 67, 301-320
[4] Park, T. & Casella, G. The Bayesian Lasso Journal of the American Statistical Association, 2008, 103, 681-686
[5] Kyung, M.; Gill, J.; Ghosh, M. & Casella, G. Penalized Regression, Standard Errors, and Bayesian Lassos Bayesian Analysis, 2010, 5, 369-412
The model of Matrix Factorization (MF)has been widely used in rating prediction. To predict the rating, with dual problem, we need to minimize the equation in the baseline model:
$$min=\sum_{u,i}{r_{u,i}-p_uq_i} + \lambda(\sum_{u}{||p_u||^2}+\sum_{i}{||q_i||^2})$$
where $\lambda$ is prameter to the regularisation part of the objective function, which avoids overfitting.
For the SGD,...
For the ALS, we ...
The Stochastic gradient descent is often chosen with its speed and easy of implementation. ALS can be parallelised and handle non-sparse datasets faster than SGD.
感谢@卿卿Antony总结的中文情感分析语料:
-
COAE的http://t.cn/hDhkkH
-
SONGBO TAN的[http://t.cn/aWEZ0Z]。NTCIR也有的不过貌似要授权的哦。
-
另外就是知网HowNet的
All for what you so love github - on your server!
[shell] 根据进程名杀死进程-kill
根据进程名获取PID
#kill -9 $(ps -ef|grep 进程名关键字|gawk '$0 !~/grep/ {print $2}' |tr -s '\n' ' ')
Python 日志模块 mongodb-log
使用 MongoDB 存储信息的 Python 日志模块 mongodb-log 。
1 非常轻巧;
2 提供 Web 界面方便远程浏览(使用Web.py框架)。
如果你在使用 Python ,巧合的是需要日志系统,更巧合的也对 MongoDB 感兴趣,更更巧合的是需要个简单的 Web 界面,那么这个模块完美解决以上问题咯。
分享几个NLP工具
Language Detection Library for Java
-
Generate language profiles from Wikipedia abstract xml
-
Detect language of a text using naive Bayesian filter
-
99% over precision for 53 languages
Google Ngram Viewer
Here are the datasets backing the Google Books Ngram Viewer. These datasets were generated in July 2009; we will update these datasets as our book scanning continues, and the updated versions will have distinct and persistent version identifiers (20090715 for the current set).
Boilerplate Removal and Fulltext Extraction from HTML pages
The boilerpipe library provides algorithms to detect and remove the surplus "clutter" (boilerplate, templates) around the main textual content of a web page.
定时检测程序是否运行的脚本shell
前端时间写了个python脚本从API中获取数据。像这种程序运行肯定是要几个月的,但是运行了一天后,问题出现了,"程序因为socket连接的问题自动退出"。到网上查了一些文章,解决方案比较少,试了几个都不work。而且最近事情也比较多,所以就硬着头皮写了个shell自动检测程序是否运行正常,如果程序退出,那么自动启动该程序。
基本原理很简单,就是通过ps命令查看某个进程是否存在。下面是基本的代码:
01 check_process(){
02 # check the args
03 if [ "$1" = "" ];
04 then
05 return 0
06 fi
07
08 #PROCESS_NUM => get the process number regarding the given thread name
09 PROCESS_NUM=$(ps -ef | grep "$1" | grep -v "grep" | wc -l)
10 if [ $PROCESS_NUM -eq 1 ];
11 then
12 return 1
13 else
14 return 0
15 fi
16 }
17
18 # check wheter the instance of thread exsits
19 while [ 1 ] ; do
20 echo 'begin checking...'
21 check_process "test" # the thread name
22 CHECK_RET = $?
23 if [ $CHECK_RET -eq 0 ]; # none exist
24 then
25 # do something...
26 fi
27 sleep 60
28 done
后台运行Linux程序
很多时候,我们的程序需要运行很长一段时间(SVM,crawler,downloader, etc)。在终端启动一个程序,如果关闭这个终端,那么这个程序也就被终止了。除了使用screen命令外(需要screen工具,有些服务器是不提供的),只能把程序转到后台执行了。
在Unix/Linux下如果想让程序独立终端运行,一般都是使用 & 在命令结尾来让程序自动运行。(命令后可以不追加空格)
但是如果这个程序输出debug信息,可以按enter终止输出信息,但是如果程序一直有printf,那么这个终端会一直被霸占着,你无法输入其他命令。
另外,需要注意的是,只有当终端出现$或者#号的时候,你关闭终端后程序才能继续运行,否则也会被终止。
这个命令可以把debug信息默认输出到nohup.out文件中。但是你仍然还是无法在这个终端进行其他的操作。
解决这个问题,可以和第一种方法结合,使用
$ nohup command > log &
以上命令同时把debug信息重定向到log文件中了,并且后面加了&。这样,只要终端出现了$ or #,你就可以为所欲为了。
还有就是,如果程序出错了,那么出错信息对我们developer是多么的弥足珍贵是不必赘述了。而上面的命令只是把标准输出内容重定向到log文件了,那么stderr的输出就丢了。解决这个问题,可以采用以下命令
$ nohup command > log 2>&1 &
在shell中,文件描述符通常是:STDIN,STDOUT,STDERR,即:0,1,2。上面的命令把STDERR的信息重定向到STDOUT中,也就是也输出到log中。
最后,个人感觉,解决程序后台运行问题,我觉得最好的方案还是screen一键搞定最简单。
Text Similarity, NLP Seminar at Leeds, Feb. 2010 (from Sharaf)
Before data mining itself, data preprocessing plays a crucial role. One of the first steps concerns the normalization of the data. This step is very important when dealing with parameters of different units and scales. For example, some data mining techniques use the Euclidean distance. Therefore, all parameters should have the same scale for a fair comparison between them.
Two methods are usually well known for rescaling data. Normalization, which scales all numeric variables in the range [0,1].
$$x_{new}=\frac{x-x_{min}}{x_{max}-x_{min}}$$
On the other hand, you can use standardization on your data set. It will then transform it to have zero mean and unit variance, for example using the equation below:
$$x_{new} = \frac{x-\mu}{\sigma}$$
Both of these techniques have their drawbacks. If you have outliers in your data set, normalizing your data will certainly scale the “normal” data to a very small interval. And generally, most of data sets have outliers. When using standardization, you make an assumption that your data have been generated with a Gaussian law (with a certain mean and standard deviation). This may not be the case in reality.