soup rocks

Quite inspired by my sister writing articles on blogger.com,I too join the race and would try and write some interesting ones here.So here I go...

Sunday, June 18, 2006

MINING RESEARCH PAPERS: making of an interface to use PATMAP for mining research papers

Research papers have been an important mode of research transfer.
In recent years, there has been substantial public and private interest in the concept of technology transfer, especially, but not exclusively, at universities. This is important to inventors, researchers and small entrepreneurs looking to develop innovative technology, as well as technology firms striving to create new innovations, manufacturers conducting research and development (R&D) to generate new products, investors looking for new growth companies, and government officials seeking to find ways to spur and support economic development.

There are many search engines currently providing databases of various publications, eg google scholar (www.scholar.google.com) , proquest (www.proquest.com) etc. We have concentrated our work to google scholar. Google Scholar enables specific searches of scholarly literature, including peer-reviewed papers, theses, books, pre-prints, abstracts, and technical reports. Content includes a range of publishers and aggregators with whom Google already has standing arrangements, e.g., the Association for Computing Machinery, IEEE, OCLC’s Open WorldCat library locator service, etc. Result displays will show different version clusters, citation analysis, and library location (currently books only). Although claiming coverage “from all broad areas of research,” early evaluation seems to show a clear emphasis on science and technology, rather than the arts, humanities, or social sciences. Google does not specify the number of journals or publications it has included.

Searching these databases has been a major problem as
1. Search engines give thousands of research papers in which it is difficult to find the research paper of your relevance.
2. Searches are performed on the basis of keywords or a Boolean combination of them(AND,OR,NOT etc) being entered by the user, rather than the content of the paper.

Advanced search may allow the user to specify keywords in the sections namely title, abstract, author, publisher etc. Search can be done as per the dates too.

Mining process aims at extraction of the technology development path and technology mapping from the database of research papers. Text mining based on co-citation co-classification or co-word analysis can reveal relevant links. Text mining is semi automatic in nature and requires human intervention. Data / text mining is defined as the non-trivial, semiautomatic extraction of implicit, previously unknown, and potentially useful information from the data.

We have focused on reducing the irrelevant search results obtained by user after entering his set of keywords. Currently there are thousands of results being displayed. We aim at creating an intermediate platform for user to allow him to reduce his search to the areas of his interest.

PATMAP: a medium to find your patents in USPTO DATABASE
Refer for details: The article in Directions written on text mining. The same concept has been used to map research papers. An interface has been created to make the search more accurate and to enable the software to take in research papers instead of Patents


Vesutek: a medium to get your research paper

We have made an attempt to work on the results of google scholar and to reduce the results to user specified research papers.

The problems faced and the solutions to those:
• Papers might be present in different formats. That is text, html, pdf etc. All different documents can’t be parsed through. So all the documents are reduced to HTML format and the an attempt is made to
parse them through hyperlinks without storing all the data.
• Semi automatic process may prove to be very tedious. The process can be sorted out to automatic process (though there is a loss of precision) if desired by the user through clustering. The top ten or twenty results proposed by the google can be entered into the database as the relevant links. Now the irrelevant pool consists of both, the results considered relevant and the rest results. These are mined. The process can be repeated as many times as requested by user. This can make the work of user easier
• The research papers lying in one class should not lie under any other class. These classes can be dealt as topics. The user can choose the topic for his search and can thus reduce his results substantially.

• Classifications being made may just be eliminated in the case of automatization of user interface.



Work done:
1. Link grabber has been made to catch the hyperlinks given by google. The work of mining is proposed to be done without downloading the HTML pages.
2. The HTML document has been made to enter the database for a specific publication.(Synergy Blackwell)
3. The previous works on mining of patents have been appreciated and an attempt to use the concept in case of research papers has been made.

Work proposed:

1. To make the process quicker and to reduce the complexity (in terms of time and memory) an algorithm is to be made to pick in top ten results of google scholar or any other search engine for that matter and to perform iteration considering those results relevant. No of iterations can be user specified.
2. The work would always have higher precision with discriminant analysis rather than just using vectors to represent documents (reference: vinod’s thesis).

Scope for future work:

• Working on proquest could be much more beneficial than working on google scholar for the same methodology. That would enable the user to scan through the entire database as proquest provides all the research papers in a single format.
• Although an attempt has been made to make the ware user friendly but that area still has scope for further work using clustering.

……………x……………………x…………………………x……………

Special thanks to Dr. Veena Bansal for giving me an opportunity to explore the world of data mining.I would even like to thank Sunita Shukla for her support and for doing all the coding without which nothing would have been possible.I would also like to thank vinod singh rathore for sharing his experience in this field with me
It was a pleasure to work at IME department IITK.

References
1. Thesis by Bhuvan Mirdha IME department IITK
2. Thesis by Vinod Singh Rathore IME department IITK
3. An article on data mining by Dr. Veena Bansal and Dr.A.K. Mittal in Directions volume 7 no. 3.

if there has been a problem regarding understanding PATMAP.....wait for my next post

0 Comments:

Post a Comment

<< Home