Search

TRANS-VQA : Fully Transformer-Based Image Question-Answering Model Using Question-guided Vision Attention

<?xml version="1.0" encoding="UTF-8"?><modsCollection xmlns="http://www.loc.gov/mods/v3" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.loc.gov/mods/v3 http://www.loc.gov/standards/mods/v3/mods-3-8.xsd">
<mods version="3.8">
<titleInfo>
<title>TRANS-VQA</title>
<subTitle>: Fully Transformer-Based Image Question-Answering Model Using Question-guided Vision Attention</subTitle>
</titleInfo>
<name type="personal" xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="MAPA20240020842">
<namePart>Koshti, Dipali </namePart>
<nameIdentifier>MAPA20240020842</nameIdentifier>
</name>
<typeOfResource>text</typeOfResource>
<genre authority="marcgt">periodical</genre>
<originInfo>
<place>
<placeTerm type="code" authority="marccountry">esp</placeTerm>
</place>
<dateIssued encoding="marc">2024</dateIssued>
<issuance>serial</issuance>
</originInfo>
<language>
<languageTerm type="code" authority="iso639-2b">eng</languageTerm>
</language>
<physicalDescription>
<form authority="marcform">print</form>
</physicalDescription>
<abstract displayLabel="Summary">Understanding multiple modalities and relating them is an easy task for humans. But for machines, this is a stimulating task. One such multi-modal reasoning task is Visual question answering which demands the machine to produce an answer for the natural language query asked based on the given image. Although plenty of work is done in this field, there is still a challenge of improving the answer prediction ability of the model and breaching human accuracy. A novel model for answering image-based questions based on a transformer has been proposed. The proposed model is a fully Transformer-based architecture that utilizes the power of a transformer for extracting language features as well as for performing joint understanding of question and image features. The proposed VQA model utilizes F-RCNN for image feature extraction</abstract>
<note type="statement of responsibility">Dipali Koshti [et al.]</note>
<subject xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="MAPA20080611200">
<topic>Inteligencia artificial</topic>
</subject>
<subject xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="MAPA20080541408">
<topic>Imagen</topic>
</subject>
<subject xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="MAPA20080548056">
<topic>Máquinas</topic>
</subject>
<subject xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="MAPA20080617479">
<topic>Lenguajes de programación</topic>
</subject>
<classification authority="">922.134</classification>
<location>
<url displayLabel="electronic resource" usage="primary display">https://journal.iberamia.org/index.php/intartif/article/view/1252</url>
</location>
<relatedItem type="host">
<titleInfo>
<title>Revista Iberoamericana de Inteligencia Artificial</title>
</titleInfo>
<originInfo>
<publisher> : IBERAMIA, Sociedad Iberoamericana de Inteligencia Artificial , 2018-</publisher>
</originInfo>
<identifier type="issn">1988-3064</identifier>
<identifier type="local">MAP20200034445</identifier>
<part>
<text>19/06/2024 Volumen 27 Número 73 - junio 2024 , p.11-128</text>
</part>
</relatedItem>
<recordInfo>
<recordContentSource authority="marcorg">MAP</recordContentSource>
<recordCreationDate encoding="marc">240829</recordCreationDate>
<recordChangeDate encoding="iso8601">20240829104256.0</recordChangeDate>
<recordIdentifier source="MAP">MAP20240013271</recordIdentifier>
<languageOfCataloging>
<languageTerm type="code" authority="iso639-2b">spa</languageTerm>
</languageOfCataloging>
</recordInfo>
</mods>
</modsCollection>