Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
…
5 pages
1 file
This paper addresses the challenges of table extraction from Portable Document Format (PDF) files, particularly focusing on the difficulties posed by various PDF types such as True PDFs, Scanned PDFs, and Searchable PDFs. It proposes an automated solution that facilitates the conversion of searchable PDFs into XML format and extracts relevant data for storage in a NoSQL database. This approach streamlines the data processing workflow for educational institutions, significantly reducing manual effort while enabling efficient results analysis and insights.
Current survey done on today's scenario shows, result gadget declared by Universities(eg. Pune Uni.) for engineering is in PDF file format. The PDF data contents detail such as seat no, centre, permanent registration no.(PRN), Name, Subjects, Marks, etc. Presently PDF file is extracted in excel file format, this conversion is done in order to extract various reporting formats required by department/college/university at various level. Thus, it involves somewhat manual process. However, all these operation have certain limitations such as semi-automated process, no GUI present, SMS gateway is not support, E-mail gateway is not supported, and mainly graphical analysis of data is not available. On the basis of survey done, we came across existing applications which are semi-automated or automated with some restrictions which does not allow full automation of result analysis in proper format. Thus none of the applications supported the full automation. To overcome above said drawbac...
Current survey done on today’s scenario shows, result gadget declared by Universities(eg. Pune Uni.) for engineering is in PDF file format. The PDF data contents detail such as seat no, centre, permanent registration no.(PRN), Name, Subjects, Marks, etc. Presently PDF file is extracted in excel file format, this conversion is done in order to extract various reporting formats required by department/college/university at various level. Thus, it involves somewhat manual process. However, all these operation have certain limitations such as semi-automated process, no GUI present, SMS gateway is not support, E-mail gateway is not supported, and mainly graphical analysis of data is not available. On the basis of survey done, we came across existing applications which are semi-automated or automated with some restrictions which does not allow full automation of result analysis in proper format. Thus none of the applications supported the full automation. To overcome above said drawbacks, we proposed a new system for result analysis, which is automated with features like Auto-output generation in different database format like excel, PDF, Mysql for further compatibility with other ERP system as per user selection, active SMS gateway, active Email gateway, interactive and user friendly GUI, graphical result analysis with text. In Proposed system we have targeted the limitations to provide effective solution for result analysis. This system will also work on current grade system. Where we are going to maintain database of students which will show whole status of students. Automated solutions provided by the system will make exam department activities more efficient by covering most of the important drawbacks of manual system, namely speed, precision and simplicity. It will also work as a generalized system to support any type and format of PDF file. A centralized system will ensure that the activities in the context of an examination can be managed effectively, while also making it more accessible and convenient for both staff and students.
Encyclopedia of Information Science and Technology, Third Edition, 2015
1995
A strategy for document analysis is presented which uses Portable Document Format (PDF -the underlying file structure for Adobe Acrobat software) as its starting point. This strategy examines the appearance and geometric position of text and image blocks distributed over an entire document. A blackboard system is used to tag the blocks as a first stage in deducing the fundamental relationships existing between them. PDF is shown to be a useful intermediate stage in the bottom-up analysis of document structure. Its information on line spacing and font usage gives important clues in bridging the 'semantic gap' between the scanned bitmap page and its fully analysed, block-structured form. Analysis of PDF can yield not only accurate page decomposition but also sufficient document information for the later stages of structural analysis and document understanding.
The 15 th International Scientific Conference eLearning and Software for Education, 2019
We live in the century of technology, where the enormous evolution of data and science has recently favored a strong interest in processing, transmitting, and storing information. If, in the past, only a human mind could extract meaningful information from image data, after decades of dedicated research, scientists have managed to build complex systems that can identify different areas, tables, and texts from scanned documents, all the obtained information being easily accessed and passed by one to another. Books, newspapers, maps, letters, drawings-all types of documents can be scanned and processed in order to become available in a digital format. In the digital world, the storage space is very small compared to physical documents, so these applications will replace millions of old paper volumes with a single memory disk and will be accessible at the same time for anyone using just Internet access and without having a risk of deterioration. Other problems, such as ecological issues, accessibility and flexibility constraints can be solved by the use of document image analysis systems. This article presents the methods and techniques used to process on-paper documents and convert them to electronic ones, starting from pixel level and getting to the level of the entire document. The main purpose of Document Image Analysis Systems is to recognize texts and graphical interpretations from images, extract, format and present their contained information accordingly to the people's needs. We will also try to provide solid ground for practitioners that implement systems from this category to enhance the unsupervised processing features in order to make physical documents easily available to the masses.
International Journal of New Computer Architectures and Their Applications, 2011
2021
The massive production of documents in portable document format (PDF) format has motivated research on automated extraction of data contained in these files. This work is mainly focused on extractions of natively digital PDF documents, made available in large repositories of educational exams. For this, the educational tests applied at Enade were used and collected automatically using scripts developed with Scrapy. The files used for the evaluation comprise 343 tests, with 11.196 objective and discursive questions, 396 answers, with 14.475 alternatives extracted from the objective questions. For the construction of ground truth in the tests, the Aletheia tool was used. For the extractions, existing tools were used that perform data extractions in PDF files: tabular data extractions, with Excalibur and Tabula for answer extractions, textual content extractions, with CyberPDF and PDFMiner to extract the questions, and extractions of regions of interest, with Aletheia and ExamClipper f...
International Journal of Engineering Research and, 2020
The cornerstone of this project is to create a DAR system that will convert its functionality into applications in the areas of research and commercial systems. The system will allow the user to convert printed text into an editable digital format, making recognition easier and more efficient. The system will rely on well-known methods, such as image processing, pattern recognition and artificial intelligence, to which will be added a unique technique, namely natural image processing. Effective use of the CNN algorithm would help reduce image blur while effectively recognizing text. The entire system will be a threestep process of creating a digital document from a hard copy, scanning the digital document, and recognizing templates in the document for correct, error-free results.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.
The 15 th International Scientific Conference eLearning and Software for Education , 2019
The 15 th International Scientific Conference eLearning and Software for Education, 2019
Technologies, 2019
The Electronic Library, 1993
Proceedings of the ACM conference on Document processing systems - DOCPROCS '88, 1988
International Journal of Innovation, Management and Technology, 2014
2003 Conference on Computer Vision and Pattern Recognition Workshop, 2003