Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
…
4 pages
1 file
This research discusses a PDF text extraction tool based on PDF-Renderer, emphasizing the importance of PDF as a secure file format for document exchange. It outlines the challenges of current text extractor tools, reviews notable examples, and demonstrates how PDF-Renderer can provide an alternative solution by addressing the limitations found in existing tools.
International Journal of New Computer Architectures and Their Applications, 2011
Current survey done on today's scenario shows, result gadget declared by Universities(eg. Pune Uni.) for engineering is in PDF file format. The PDF data contents detail such as seat no, centre, permanent registration no.(PRN), Name, Subjects, Marks, etc. Presently PDF file is extracted in excel file format, this conversion is done in order to extract various reporting formats required by department/college/university at various level. Thus, it involves somewhat manual process. However, all these operation have certain limitations such as semi-automated process, no GUI present, SMS gateway is not support, E-mail gateway is not supported, and mainly graphical analysis of data is not available. On the basis of survey done, we came across existing applications which are semi-automated or automated with some restrictions which does not allow full automation of result analysis in proper format. Thus none of the applications supported the full automation. To overcome above said drawbac...
1995
A strategy for document analysis is presented which uses Portable Document Format (PDF -the underlying file structure for Adobe Acrobat software) as its starting point. This strategy examines the appearance and geometric position of text and image blocks distributed over an entire document. A blackboard system is used to tag the blocks as a first stage in deducing the fundamental relationships existing between them. PDF is shown to be a useful intermediate stage in the bottom-up analysis of document structure. Its information on line spacing and font usage gives important clues in bridging the 'semantic gap' between the scanned bitmap page and its fully analysed, block-structured form. Analysis of PDF can yield not only accurate page decomposition but also sufficient document information for the later stages of structural analysis and document understanding.
Document Analysis …, 2006
Journal of Advanced Management Science, 2016
Born-digital PDF electronic documents might reasonably be expected to preserve useful data units of their source originals that suffice to produce executable papers for reproducible research. Unfortunately, developers of authoring tools may adopt arbitrary PDF generation strategies, producing a plethora of internal data representations. Such common information units as text paragraphs, tables, function graphs and flow diagrams, may require numerous heuristics to handle properly each vendor specific PDF file content. We propose a generic Reverse MVC interpretation pattern that enables to cope with that arbitrariness in a systematic way. It constitutes a component of a larger framework we have been developing for making executable papers out of PDF documents without injecting in the PDF file any extra data or code.
Current survey done on today’s scenario shows, result gadget declared by Universities(eg. Pune Uni.) for engineering is in PDF file format. The PDF data contents detail such as seat no, centre, permanent registration no.(PRN), Name, Subjects, Marks, etc. Presently PDF file is extracted in excel file format, this conversion is done in order to extract various reporting formats required by department/college/university at various level. Thus, it involves somewhat manual process. However, all these operation have certain limitations such as semi-automated process, no GUI present, SMS gateway is not support, E-mail gateway is not supported, and mainly graphical analysis of data is not available. On the basis of survey done, we came across existing applications which are semi-automated or automated with some restrictions which does not allow full automation of result analysis in proper format. Thus none of the applications supported the full automation. To overcome above said drawbacks, we proposed a new system for result analysis, which is automated with features like Auto-output generation in different database format like excel, PDF, Mysql for further compatibility with other ERP system as per user selection, active SMS gateway, active Email gateway, interactive and user friendly GUI, graphical result analysis with text. In Proposed system we have targeted the limitations to provide effective solution for result analysis. This system will also work on current grade system. Where we are going to maintain database of students which will show whole status of students. Automated solutions provided by the system will make exam department activities more efficient by covering most of the important drawbacks of manual system, namely speed, precision and simplicity. It will also work as a generalized system to support any type and format of PDF file. A centralized system will ensure that the activities in the context of an examination can be managed effectively, while also making it more accessible and convenient for both staff and students.
Technologies, 2019
In the age of digitalization, the collection and analysis of large amounts of data is becoming increasingly important for enterprises to improve their businesses and processes, such as the introduction of new services or the realization of resource-efficient production. Enterprises concentrate strongly on the integration, analysis and processing of their data. Unfortunately, the majority of data analysis focuses on structured and semi-structured data, although unstructured data such as text documents or images account for the largest share of all available enterprise data. One reason for this is that most of this data is not machine-readable and requires dedicated analysis methods, such as natural language processing for analyzing textual documents or object recognition for recognizing objects in images. Especially in the latter case, the analysis methods depend strongly on the application. However, there are also data formats, such as PDF documents, which are not machine-readable a...
… Image Analysis for …, 2005
J. Univers. Comput. Sci., 2012
Text preprocessing and segmentation are critical tasks in search and text mining applications. Due to the huge amount of documents that are exclusively pre- sented in PDF format, most of the Data Mining (DM) and Information Retrieval (IR) systems must extract content from the PDF files. In some occasions this is a difficult task: the result of the extraction process from a PDF file is plain text, and it should be returned in the same order as a human would read the original PDF file. However, current tools for PDF text extraction fail in this objective when working with complex documents with multiple columns. For instance, this is the case of official government bulletins with legal information. In this task, it is mandatory to get correct and or- dered text as a result of the application of the PDF extractor. It is very usual that a legal article in a document refers to a previous article and they should be offered in the right sequential order. To overcome these difficulties we h...
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.
International Journal of Applied Engineering Research
Electronic Publishing - Origination, Dissemination and Design, 1995
International Journal on Artificial Intelligence Tools, 2009
2008 The Eighth IAPR International Workshop on Document Analysis Systems, 2008
International Journal of Knowledge and Web Intelligence, 2011