Relevant bibliographies by topics / Visual attention, artificial intelligence, machine learning, computer vision

Journal articles
Dissertations / Theses
Books
Book chapters
Conference papers

Academic literature on the topic 'Visual attention, artificial intelligence, machine learning, computer vision'

Author: Grafiati

Published: 10 March 2023

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the lists of relevant articles, books, theses, conference reports, and other scholarly sources on the topic 'Visual attention, artificial intelligence, machine learning, computer vision.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Journal articles on the topic "Visual attention, artificial intelligence, machine learning, computer vision"

Wan, Yijie, and Mengqi Ren. "New Visual Expression of Anime Film Based on Artificial Intelligence and Machine Learning Technology." Journal of Sensors 2021 (June 26, 2021): 1–10. http://dx.doi.org/10.1155/2021/9945187.

Full text

Abstract:

With the improvement of material living standards, spiritual entertainment has become more and more important. As a more popular spiritual entertainment project, film and television entertainment is gradually receiving attention from people. However, in recent years, the film industry has developed rapidly, and the output of animation movies has also increased year by year. How to quickly and accurately find the user’s favorite movies in the huge amount of animation movie data has become an urgent problem. Based on the above background, the purpose of this article is to study the new visual expression of animation movies based on artificial intelligence and machine learning technology. This article takes the film industry’s informatization and intelligent development and upgrading as the background, uses computer vision and machine learning technology as the basis to explore new methods and new models for realizing film visual expression, and proposes relevant thinking to promote the innovative development of film visual expression from a strategic level. This article takes the Hollywood anime movie “Kung Fu Panda” as a sample and uses convolutional neural algorithms to study its new visual expression. The study found that after the parameters of the model were determined, the accuracy of the test set did not change much, all around 57%. This is of great significance for improving the audiovisual quality and creative standards of film works and promoting the healthy and sustainable development of the film industry.

APA, Harvard, Vancouver, ISO, and other styles

Anh, Dao Nam. "Interestingness Improvement of Face Images by Learning Visual Saliency." Journal of Advanced Computational Intelligence and Intelligent Informatics 24, no. 5 (September 20, 2020): 630–37. http://dx.doi.org/10.20965/jaciii.2020.p0630.

Full text

Abstract:

Connecting features of face images with the interestingness of a face may assist in a range of applications such as intelligent visual human-machine communication. To enable the connection, we use interestingness and image features in combination with machine learning techniques. In this paper, we use visual saliency of face images as learning features to classify the interestingness of the images. Applying multiple saliency detection techniques specifically to objects in the images allows us to create a database of saliency-based features. Consistent estimation of facial interestingness and using multiple saliency methods contribute to estimate, and exclusively, to modify the interestingness of the image. To investigate interestingness – one of the personal characteristics in a face image, a large benchmark face database is tested using our method. Taken together, the method may advance prospects for further research incorporating other personal characteristics and visual attention related to face images.

APA, Harvard, Vancouver, ISO, and other styles

V., Dr Suma. "COMPUTER VISION FOR HUMAN-MACHINE INTERACTION-REVIEW." Journal of Trends in Computer Science and Smart Technology 2019, no. 02 (December 29, 2019): 131–39. http://dx.doi.org/10.36548/jtcsst.2019.2.006.

Full text

Abstract:

The paper is a review on the computer vision that is helpful in the interaction between the human and the machines. The computer vision that is termed as the subfield of the artificial intelligence and the machine learning is capable of training the computer to visualize, interpret and respond back to the visual world in a similar way as the human vision does. Nowadays the computer vision has found its application in broader areas such as the heath care, safety security, surveillance etc. due to the progress, developments and latest innovations in the artificial intelligence, deep learning and neural networks. The paper presents the enhanced capabilities of the computer vision experienced in various applications related to the interactions between the human and machines involving the artificial intelligence, deep learning and the neural networks.

APA, Harvard, Vancouver, ISO, and other styles

Prijs, Jasper, Zhibin Liao, Soheil Ashkani-Esfahani, Jakub Olczak, Max Gordon, Prakash Jayakumar, Paul C. Jutte, Ruurd L. Jaarsma, Frank F. A. IJpma, and Job N. Doornberg. "Artificial intelligence and computer vision in orthopaedic trauma." Bone & Joint Journal 104-B, no. 8 (August 1, 2022): 911–14. http://dx.doi.org/10.1302/0301-620x.104b8.bjj-2022-0119.r1.

Full text

Abstract:

Artificial intelligence (AI) is, in essence, the concept of ‘computer thinking’, encompassing methods that train computers to perform and learn from executing certain tasks, called machine learning, and methods to build intricate computer models that both learn and adapt, called complex neural networks. Computer vision is a function of AI by which machine learning and complex neural networks can be applied to enable computers to capture, analyze, and interpret information from clinical images and visual inputs. This annotation summarizes key considerations and future perspectives concerning computer vision, questioning the need for this technology (the ‘why’), the current applications (the ‘what’), and the approach to unlocking its full potential (the ‘how’). Cite this article: Bone Joint J 2022;104-B(8):911–914.

APA, Harvard, Vancouver, ISO, and other styles

Liu, Yang, Anbu Huang, Yun Luo, He Huang, Youzhi Liu, Yuanyuan Chen, Lican Feng, Tianjian Chen, Han Yu, and Qiang Yang. "Federated Learning-Powered Visual Object Detection for Safety Monitoring." AI Magazine 42, no. 2 (October 20, 2021): 19–27. http://dx.doi.org/10.1609/aimag.v42i2.15095.

Full text

Abstract:

Visual object detection is an important artificial intelligence (AI) technique for safety monitoring applications. Current approaches for building visual object detection models require large and well-labeled dataset stored by a centralized entity. This not only poses privacy concerns under the General Data Protection Regulation (GDPR), but also incurs large transmission and storage overhead. Federated learning (FL) is a promising machine learning paradigm to address these challenges. In this paper, we report on FedVision—a machine learning engineering platform to support the development of federated learning powered computer vision applications—to bridge this important gap. The platform has been deployed through collaboration between WeBank and Extreme Vision to help customers develop computer vision-based safety monitoring solutions in smart city applications. Through actual usage, it has demonstrated significant efficiency improvement and cost reduction while fulfilling privacy-preservation requirements (e.g., reducing communication overhead for one company by 50 fold and saving close to 40,000RMB of network cost per annum). To the best of our knowledge, this is the first practical application of FL in computer vision-based tasks.

APA, Harvard, Vancouver, ISO, and other styles

L, Anusha, and Nagaraja G. S. "Outlier Detection in High Dimensional Data." International Journal of Engineering and Advanced Technology 10, no. 5 (June 30, 2021): 128–30. http://dx.doi.org/10.35940/ijeat.e2675.0610521.

Full text

Abstract:

Artificial intelligence (AI) is the science that allows computers to replicate human intelligence in areas such as decision-making, text processing, visual perception. Artificial Intelligence is the broader field that contains several subfields such as machine learning, robotics, and computer vision. Machine Learning is a branch of Artificial Intelligence that allows a machine to learn and improve at a task over time. Deep Learning is a subset of machine learning that makes use of deep artificial neural networks for training. The paper proposed on outlier detection for multivariate high dimensional data for Autoencoder unsupervised model.

APA, Harvard, Vancouver, ISO, and other styles

Li, Jing, and Guangren Zhou. "Visual Information Features and Machine Learning for Wushu Arts Tracking." Journal of Healthcare Engineering 2021 (August 4, 2021): 1–6. http://dx.doi.org/10.1155/2021/6713062.

Full text

Abstract:

Martial arts tracking is an important research topic in computer vision and artificial intelligence. It has extensive and vital applications in video monitoring, interactive animation and 3D simulation, motion capture, and advanced human-computer interaction. However, due to the change of martial arts’ body posture, clothing variability, and light mixing, the appearance changes significantly. As a result, accurate posture tracking becomes a complicated problem. A solution to this complicated problem is studied in this paper. The proposed solution improves the accuracy of martial arts tracking by the image representation method of martial arts tracking. This method is based on the second-generation strip wave transform and applies it to the video martial arts tracking based on the machine learning method.

APA, Harvard, Vancouver, ISO, and other styles

Mogadala, Aditya, Marimuthu Kalimuthu, and Dietrich Klakow. "Trends in Integration of Vision and Language Research: A Survey of Tasks, Datasets, and Methods." Journal of Artificial Intelligence Research 71 (August 30, 2021): 1183–317. http://dx.doi.org/10.1613/jair.1.11688.

Full text

Abstract:

Interest in Artificial Intelligence (AI) and its applications has seen unprecedented growth in the last few years. This success can be partly attributed to the advancements made in the sub-fields of AI such as machine learning, computer vision, and natural language processing. Much of the growth in these fields has been made possible with deep learning, a sub-area of machine learning that uses artificial neural networks. This has created significant interest in the integration of vision and language. In this survey, we focus on ten prominent tasks that integrate language and vision by discussing their problem formulation, methods, existing datasets, evaluation measures, and compare the results obtained with corresponding state-of-the-art methods. Our efforts go beyond earlier surveys which are either task-specific or concentrate only on one type of visual content, i.e., image or video. Furthermore, we also provide some potential future directions in this field of research with an anticipation that this survey stimulates innovative thoughts and ideas to address the existing challenges and build new applications.

APA, Harvard, Vancouver, ISO, and other styles

Zhou, Zhiyu, Jiangfei Ji, Yaming Wang, Zefei Zhu, and Ji Chen. "Hybrid regression model via multivariate adaptive regression spline and online sequential extreme learning machine and its application in vision servo system." International Journal of Advanced Robotic Systems 19, no. 3 (May 1, 2022): 172988062211086. http://dx.doi.org/10.1177/17298806221108603.

Full text

Abstract:

To solve the problems of slow convergence speed, poor robustness, and complex calculation of image Jacobian matrix in image-based visual servo system, a hybrid regression model based on multiple adaptive regression spline and online sequential extreme learning machine is proposed to predict the product of pseudo inverse of image Jacobian matrix and image feature error and online sequential extreme learning machine is proposed to predict the product of pseudo inverse of image Jacobian matrix and image feature error. In MOS-ELM, MARS is used to evaluate the importance of input features and select specific features as the input features of online sequential extreme learning machine, so as to obtain better generalization performance and increase the stability of regression model. Finally, the method is applied to the speed predictive control of the manipulator end effector controlled by image-based visual servo and the prediction of machine learning data sets. Experimental results show that the algorithm has high prediction accuracy on machine learning data sets and good control performance in image-based visual servo.

APA, Harvard, Vancouver, ISO, and other styles

Liu, Yang, Anbu Huang, Yun Luo, He Huang, Youzhi Liu, Yuanyuan Chen, Lican Feng, Tianjian Chen, Han Yu, and Qiang Yang. "FedVision: An Online Visual Object Detection Platform Powered by Federated Learning." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 08 (April 3, 2020): 13172–79. http://dx.doi.org/10.1609/aaai.v34i08.7021.

Full text

Abstract:

Visual object detection is a computer vision-based artificial intelligence (AI) technique which has many practical applications (e.g., fire hazard monitoring). However, due to privacy concerns and the high cost of transmitting video data, it is highly challenging to build object detection models on centrally stored large training datasets following the current approach. Federated learning (FL) is a promising approach to resolve this challenge. Nevertheless, there currently lacks an easy to use tool to enable computer vision application developers who are not experts in federated learning to conveniently leverage this technology and apply it in their systems. In this paper, we report FedVision - a machine learning engineering platform to support the development of federated learning powered computer vision applications. The platform has been deployed through a collaboration between WeBank and Extreme Vision to help customers develop computer vision-based safety monitoring solutions in smart city applications. Over four months of usage, it has achieved significant efficiency improvement and cost reduction while removing the need to transmit sensitive data for three major corporate customers. To the best of our knowledge, this is the first real application of FL in computer vision-based tasks.

APA, Harvard, Vancouver, ISO, and other styles

More sources

Dissertations / Theses on the topic "Visual attention, artificial intelligence, machine learning, computer vision"

Mahendru, Aroma. "Role of Premises in Visual Question Answering." Thesis, Virginia Tech, 2017. http://hdl.handle.net/10919/78030.

Full text

Abstract:

In this work, we make a simple but important observation questions about images often contain premises -- objects and relationships implied by the question -- and that reasoning about premises can help Visual Question Answering (VQA) models respond more intelligently to irrelevant or previously unseen questions. When presented with a question that is irrelevant to an image, state-of-the-art VQA models will still answer based purely on learned language biases, resulting in nonsensical or even misleading answers. We note that a visual question is irrelevant to an image if at least one of its premises is false (i.e. not depicted in the image). We leverage this observation to construct a dataset for Question Relevance Prediction and Explanation (QRPE) by searching for false premises. We train novel irrelevant question detection models and show that models that reason about premises consistently outperform models that do not. We also find that forcing standard VQA models to reason about premises during training can lead to improvements on tasks requiring compositional reasoning.
Master of Science

APA, Harvard, Vancouver, ISO, and other styles

Bui, Anh Duc. "Visual Scene Understanding through Scene Graph Generation and Joint Learning." Thesis, University of Sydney, 2023. https://hdl.handle.net/2123/29954.

Full text

Abstract:

Deep visual scene understanding is an essential part for the development of high-level visual understanding tasks such as storytelling or Visual Question Answering. One of the proposed solutions for such purposes were Scene Graphs, with the capacity to represent the semantic details of images into abstract elements using a graph structure which is both suitable for machine processing as well as human understanding. However, automatically generating reasonable and informative scene graphs remains a challenge due to the problem of long tail biases present in the annotated data available. Therefore, the goal of the thesis focuses on the generation of scene graph from images for the visual understanding in two main aspects: how scene graph can be generated with object predicates that are both reasonable with human understanding and informative enough for usage of further computer vision usage and how joint learning can be applied in the scene graph generation pipeline to further improve the quality of the output scene graph. For the first end, we addressed the problem in the scene graph generation task where uncorrelated labels are classified against each other, in which we tackled by categorising correlated labels and learning category-specific predicate features. For the second end, a shuffle transformer is proposed as a method for jointly learning the category specific features to generate a more robust and informative universal predicate feature which is used to generate better predicate labels for the scene graph. The performance of the proposed methodology is then evaluated in comparison with state-of-the-art scene graph generation methods in the fields by using mean recall metric on the subset Visual Genome which was most commonly used for scene graph generation.

APA, Harvard, Vancouver, ISO, and other styles

Rochford, Matthew. "Visual Speech Recognition Using a 3D Convolutional Neural Network." DigitalCommons@CalPoly, 2019. https://digitalcommons.calpoly.edu/theses/2109.

Full text

Abstract:

Main stream automatic speech recognition (ASR) makes use of audio data to identify spoken words, however visual speech recognition (VSR) has recently been of increased interest to researchers. VSR is used when audio data is corrupted or missing entirely and also to further enhance the accuracy of audio-based ASR systems. In this research, we present both a framework for building 3D feature cubes of lip data from videos and a 3D convolutional neural network (CNN) architecture for performing classification on a dataset of 100 spoken words, recorded in an uncontrolled envi- ronment. Our 3D-CNN architecture achieves a testing accuracy of 64%, comparable with recent works, but using an input data size that is up to 75% smaller. Overall, our research shows that 3D-CNNs can be successful in finding spatial-temporal features using unsupervised feature extraction and are a suitable choice for VSR-based systems.

APA, Harvard, Vancouver, ISO, and other styles

Salem, Tawfiq. "Learning to Map the Visual and Auditory World." UKnowledge, 2019. https://uknowledge.uky.edu/cs_etds/86.

Full text

Abstract:

The appearance of the world varies dramatically not only from place to place but also from hour to hour and month to month. Billions of images that capture this complex relationship are uploaded to social-media websites every day and often are associated with precise time and location metadata. This rich source of data can be beneficial to improve our understanding of the globe. In this work, we propose a general framework that uses these publicly available images for constructing dense maps of different ground-level attributes from overhead imagery. In particular, we use well-defined probabilistic models and a weakly-supervised, multi-task training strategy to provide an estimate of the expected visual and auditory ground-level attributes consisting of the type of scenes, objects, and sounds a person can experience at a location. Through a large-scale evaluation on real data, we show that our learned models can be used for applications including mapping, image localization, image retrieval, and metadata verification.

APA, Harvard, Vancouver, ISO, and other styles

Azizpour, Hossein. "Visual Representations and Models: From Latent SVM to Deep Learning." Doctoral thesis, KTH, Datorseende och robotik, CVAP, 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-192289.

Full text

Abstract:

Two important components of a visual recognition system are representation and model. Both involves the selection and learning of the features that are indicative for recognition and discarding those features that are uninformative. This thesis, in its general form, proposes different techniques within the frameworks of two learning systems for representation and modeling. Namely, latent support vector machines (latent SVMs) and deep learning. First, we propose various approaches to group the positive samples into clusters of visually similar instances. Given a fixed representation, the sampled space of the positive distribution is usually structured. The proposed clustering techniques include a novel similarity measure based on exemplar learning, an approach for using additional annotation, and augmenting latent SVM to automatically find clusters whose members can be reliably distinguished from background class. In another effort, a strongly supervised DPM is suggested to study how these models can benefit from privileged information. The extra information comes in the form of semantic parts annotation (i.e. their presence and location). And they are used to constrain DPMs latent variables during or prior to the optimization of the latent SVM. Its effectiveness is demonstrated on the task of animal detection. Finally, we generalize the formulation of discriminative latent variable models, including DPMs, to incorporate new set of latent variables representing the structure or properties of negative samples. Thus, we term them as negative latent variables. We show this generalization affects state-of-the-art techniques and helps the visual recognition by explicitly searching for counter evidences of an object presence. Following the resurgence of deep networks, in the last works of this thesis we have focused on deep learning in order to produce a generic representation for visual recognition. A Convolutional Network (ConvNet) is trained on a largely annotated image classification dataset called ImageNet with $\sim1.3$ million images. Then, the activations at each layer of the trained ConvNet can be treated as the representation of an input image. We show that such a representation is surprisingly effective for various recognition tasks, making it clearly superior to all the handcrafted features previously used in visual recognition (such as HOG in our first works on DPM). We further investigate the ways that one can improve this representation for a task in mind. We propose various factors involving before or after the training of the representation which can improve the efficacy of the ConvNet representation. These factors are analyzed on 16 datasets from various subfields of visual recognition.

QC 20160908

APA, Harvard, Vancouver, ISO, and other styles

Warnakulasuriya, Tharindu R. "Context modelling for single and multi agent trajectory prediction." Thesis, Queensland University of Technology, 2019. https://eprints.qut.edu.au/128480/1/Tharindu_Warnakulasuriya_Thesis.pdf.

Full text

Abstract:

This research addresses the problem of predicting future agent behaviour in both single and multi agent settings where multiple agents can enter and exit an environment, and the environment can change dynamically. Both short-term and long-term context was captured in the given domain and utilised neural memory networks to use the derived knowledge for the prediction task. The efficacy of the techniques was demonstrated by applying it to aircraft path prediction, passenger movement prediction in crowded railway stations, driverless car steering, predicting next shot location in tennis and for predicting soccer match outcomes.

APA, Harvard, Vancouver, ISO, and other styles

Hernández-Vela, Antonio. "From pixels to gestures: learning visual representations for human analysis in color and depth data sequences." Doctoral thesis, Universitat de Barcelona, 2015. http://hdl.handle.net/10803/292488.

Full text

Abstract:

The visual analysis of humans from images is an important topic of interest due to its relevance to many computer vision applications like pedestrian detection, monitoring and surveillance, human-computer interaction, e-health or content-based image retrieval, among others. In this dissertation in learning different visual representations of the human body that are helpful for the visual analysis of humans in images and video sequences. To that end, we analyze both RCB and depth image modalities and address the problem from three different research lines, at different levels of abstraction; from pixels to gestures: human segmentation, human pose estimation and gesture recognition. First, we show how binary segmentation (object vs. background) of the human body in image sequences is helpful to remove all the background clutter present in the scene. The presented method, based on “Graph cuts” optimization, enforces spatio-temporal consistency of the produced segmentation masks among consecutive frames. Secondly, we present a framework for multi-label segmentation for obtaining much more detailed segmentation masks: instead of just obtaining a binary representation separating the human body from the background, finer segmentation masks can be obtained separating the different body parts. At a higher level of abstraction, we aim for a simpler yet descriptive representation of the human body. Human pose estimation methods usually rely on skeletal models of the human body, formed by segments (or rectangles) that represent the body limbs, appropriately connected following the kinematic constraints of the human body, In practice, such skeletal models must fulfill some constraints in order to allow for efficient inference, while actually Iimiting the expressiveness of the model. In order to cope with this, we introduce a top-down approach for predicting the position of the body parts in the model, using a mid-level part representation based on Poselets. Finally, we propose a framework for gesture recognition based on the bag of visual words framework. We leverage the benefits of RGB and depth image modalities by combining modality-specific visual vocabularies in a late fusion fashion. A new rotation-variant depth descriptor is presented, yielding better results than other state-of-the-art descriptors. Moreover, spatio-temporal pyramids are used to encode rough spatial and temporal structure. In addition, we present a probabilistic reformulation of Dynamic Time Warping for gesture segmentation in video sequences, A Gaussian-based probabilistic model of a gesture is learnt, implicitly encoding possible deformations in both spatial and time domains.
L’anàlisi visual de persones a partir d'imatges és un tema de recerca molt important, atesa la rellevància que té a una gran quantitat d'aplicacions dins la visió per computador, com per exemple: detecció de vianants, monitorització i vigilància,interacció persona-màquina, “e-salut” o sistemes de recuperació d’matges a partir de contingut, entre d'altres. En aquesta tesi volem aprendre diferents representacions visuals del cos humà, que siguin útils per a la anàlisi visual de persones en imatges i vídeos. Per a tal efecte, analitzem diferents modalitats d'imatge com són les imatges de color RGB i les imatges de profunditat, i adrecem el problema a diferents nivells d'abstracció, des dels píxels fins als gestos: segmentació de persones, estimació de la pose humana i reconeixement de gestos. Primer, mostrem com la segmentació binària (objecte vs. fons) del cos humà en seqüències d'imatges ajuda a eliminar soroll pertanyent al fons de l'escena en qüestió. El mètode presentat, basat en optimització “Graph cuts”, imposa consistència espai-temporal a Ies màscares de segmentació obtingudes en “frames” consecutius. En segon lloc, presentem un marc metodològic per a la segmentació multi-classe, amb la qual podem obtenir una descripció més detallada del cos humà, en comptes d'obtenir una simple representació binària separant el cos humà del fons, podem obtenir màscares de segmentació més detallades, separant i categoritzant les diferents parts del cos. A un nivell d'abstraccíó més alt, tenim com a objectiu obtenir representacions del cos humà més simples, tot i ésser suficientment descriptives. Els mètodes d'estimació de la pose humana sovint es basen en models esqueletals del cos humà, formats per segments (o rectangles) que representen les extremitats del cos, connectades unes amb altres seguint les restriccions cinemàtiques del cos humà. A la pràctica, aquests models esqueletals han de complir certes restriccions per tal de poder aplicar mètodes d'inferència que permeten trobar la solució òptima de forma eficient, però a la vegada aquestes restriccions suposen una gran limitació en l'expressivitat que aques.ts models son capaços de capturar. Per tal de fer front a aquest problema, proposem un enfoc “top-down” per a predir la posició de les parts del cos del model esqueletal, introduïnt una representació de parts de mig nivell basada en “Poselets”. Finalment. proposem un marc metodològic per al reconeixement de gestos, basat en els “bag of visual words”. Aprofitem els avantatges de les imatges RGB i les imatges; de profunditat combinant vocabularis visuals específiques per a cada modalitat, emprant late fusion. Proposem un nou descriptor per a imatges de profunditat invariant a rotació, que millora l'estat de l'art, i fem servir piràmides espai-temporals per capturar certa estructura espaial i temporal dels gestos. Addicionalment, presentem una reformulació probabilística del mètode “Dynamic Time Warping” per al reconeixement de gestos en seqüències d'imatges. Més específicament, modelem els gestos amb un model probabilistic gaussià que implícitament codifica possibles deformacions tant en el domini espaial com en el temporal.

APA, Harvard, Vancouver, ISO, and other styles

Novotný, Václav. "Rozpoznání displeje embedded zařízení." Master's thesis, Vysoké učení technické v Brně. Fakulta elektrotechniky a komunikačních technologií, 2018. http://www.nusl.cz/ntk/nusl-376924.

Full text

Abstract:

This master thesis deals with usage of machine learning methods in computer vision for classification of unknown images. The first part contains research of available machine learning methods, their limitations and also their suitability for this task. The second part describes the processes of creating training and testing gallery. In the practical part, the solution for the problem is proposed and later realised and implemented. Proper testing and evaluation of resulting system is conducted.

APA, Harvard, Vancouver, ISO, and other styles

ALTIERI, ALEX. "Yacht experience, ricerca e sviluppo di soluzioni basate su intelligenza artificiale per il comfort e la sicurezza in alto mare." Doctoral thesis, Università Politecnica delle Marche, 2021. http://hdl.handle.net/11566/287605.

Full text

Abstract:

La tesi descrive i risultati dell’attività di ricerca e sviluppo di nuove tecnologie basate su tecniche di intelligenza artificiale, capaci di raggiungere un’interazione empatica e una connessione emotiva tra l’uomo e “le macchine” così da migliorare il comfort e la sicurezza a bordo di uno yacht. Tale interazione è ottenuta grazie al riconoscimento di emozioni e comportamenti e alla successiva attivazione di tutti quegli apparati multimediali presenti nell’ambiente a bordo, che si adattano al mood del soggetto all’interno della stanza. Il sistema prototipale sviluppato durante i tre anni di dottorato è oggi in grado di gestire i contenuti multimediali (ad es. brani musicali, video riprodotti nei LED screen) e scenari di luce, basati sull'emozione dell'utente, riconosciute dalle espressioni facciali riprese da una qualsiasi fotocamera installata all’interno dello spazio. Per poter rendere l’interazione adattativa, il sistema sviluppato implementa algoritmi di Deep Learning per riconoscere l’identità degli utenti a bordo (riconoscimento facciale), il grado di attenzione del comandante (Gaze Detection e Drowsiness) e gli oggetti con cui egli interagisce (telefono, auricolari, ecc.). Tali informazioni vengono processate all’interno del sistema per identificare eventuali situazioni di rischio per la sicurezza delle persone presenti a bordo e per controllare l’intero ambiente. L’applicazione di queste tecnologie, in questo settore sempre aperto all’introduzione delle ultime innovazioni a bordo, apre a diverse sfide di ricerca.
The thesis describes the results of the research and development of new technologies based on artificial intelligence techniques, able to achieve an empathic interaction and an emotional connection between man and "the machines" in order to improve comfort and safety on board of yachts. This interaction is achieved through the recognition of emotions and behaviors and the following activation of all those multimedia devices available in the environment on board, which are adapted to the mood of the subject inside the room. The prototype system developed during the three years of PhD is now able to manage multimedia content (e.g. music tracks, videos played on LED screens) and light scenarios, based on the user's emotion, recognized by facial expressions taken from any camera installed inside the space. In order to make the interaction adaptive, the developed system implements Deep Learning algorithms to recognize the identity of the users on board (Facial Recognition), the degree of attention of the commander (Gaze Detection and Drowsiness) and the objects with which he interacts (phone, earphones, etc.). This information is processed within the system to identify any situations of risk to the safety of people on board and to monitor the entire environment. The application of these technologies, in this domain that is always open to the introduction of the latest innovations on board, opens up several research challenges.

APA, Harvard, Vancouver, ISO, and other styles

Zanca, Dario. "Towards laws of visual attention." Doctoral thesis, 2019. http://hdl.handle.net/2158/1159344.

Full text

Abstract:

Visual attention is a crucial process for humans and foveated animals in general. The ability to select extit{relevant} locations in the visual field greatly simplifies the problem of extit{vision}. It allows a parsimonious management of the computational resources while catching and tracking coherences within the observed temporal phenomenon. Understanding the mechanisms of attention can reveal a lot about human intelligence. At the same time, it seems increasingly important for building intelligent artificial agents that aim at approaching human performance in real-world visual tasks. For this reasons, in the past three decades, many studies have been conducted to create computational models of human attention. However, these have been often carried over as the mere prediction of the extit{saliency map}, i.e. topographic map that represents conspicuousness of scene locations. Although of great importance and usefulness in many applications, this type of study does not provide an exhaustive description of the attention mechanism, since it misses to describe its temporal component. In this thesis, we propose three models of scanpaths, i.e. trajectories of free visual exploration. These models share a fundamental idea: the evolution of the mechanisms of visual attention has been guided by fundamental functional principles. Scanpath models emerge as laws of nature, in the framework of mechanics. The approaches are mainly data-driven (bottom-up), defined on video streams and visual properties completely determine the forces that guide movements. The first proposal (EYMOL) is a theory of free visual exploration based on the general Principle of Least Action. In the framework of analytic mechanics, a scanpath emerges in accordance with three basic functional principles: boundedness of the retina, curiosity for visual details and invariance of the brightness along the trajectories. This principles are given a mathematical formulation to define a potential energy. The resulting (differential) laws of motion are very effective in predicting saliency. Due to the very local nature of this laws (computation at each time step involve only a single pixel and its close surround), this approach is suitable for real-time application. The second proposal (CF-EYMOL) expands the first model with the information coming from the internal state of a pre-trained deep fully convolutional neural network. A visualization technique is presented to effectively extract convolutional features (CF) activations. This information is then used to modify the potential field in order to favour exploration of those pixels that are more likely to belong to an object. This produces incremental results on saliency prediction. At the same time, it suggests how to introduce preferences in the visual exploration process through an external (top-down) signal. The third proposal (G-EYMOL) can be seen as a generalisation of the previous works. It is completely developed in the framework of gravitational (G) physics. No special rule is described to define the direction of exploration, except that the features themselves act as masses attracting the focus of attention. Features are given we assume they come from external calculation. In principle, they can also derive from a convolutional neural network, as in the previous proposal, or they can simply be raw brightness values. In our experiments, we use only two basic features: the spatial gradient of brightness and the optical flow. The choice, slightly inspired by the basic raw information in the earliest stage V1 of the human vision, is particularly effective in the experiments of scanpath prediction. The model also includes a dynamic process of inhibition of return defined within the same framework and which is crucial to provide the plus of energy for the exploration process. The laws of motion that are derived are integral-differential, as they also include sums over the entire retina. Despite this, the system is still widely suitable for real-time applications since only one step computation is needed to calculate the next gaze position.

APA, Harvard, Vancouver, ISO, and other styles

Books on the topic "Visual attention, artificial intelligence, machine learning, computer vision"

Yong, Rui, and Huang Thomas S. 1936-, eds. Exploration of visual data. Boston: Kluwer Academic Publishers, 2003.

Find full text

APA, Harvard, Vancouver, ISO, and other styles

Visual Saliency Computation: A Machine Learning Perspective. Springer, 2014.

Find full text

APA, Harvard, Vancouver, ISO, and other styles

Gao, Wen, and Jia Li. Visual Saliency Computation: A Machine Learning Perspective. Springer London, Limited, 2014.

Find full text

APA, Harvard, Vancouver, ISO, and other styles

Huang, Thomas S. Exploration of Visual Data. Springer My Copy UK, 2003.

Find full text

APA, Harvard, Vancouver, ISO, and other styles

Huang, Thomas S., Yong Rui, and Sean Xiang Zhou. Exploration of Visual Data (The International Series in Video Computing). Springer, 2003.

Find full text

APA, Harvard, Vancouver, ISO, and other styles

Book chapters on the topic "Visual attention, artificial intelligence, machine learning, computer vision"

Madani, Kurosh. "Robots’ Vision Humanization Through Machine-Learning Based Artificial Visual Attention." In Communications in Computer and Information Science, 8–19. Cham: Springer International Publishing, 2019. http://dx.doi.org/10.1007/978-3-030-35430-5_2.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Whitworth, Brian, and Hokyoung Ryu. "A Comparison of Human and Computer Information Processing." In Machine Learning, 1–12. IGI Global, 2012. http://dx.doi.org/10.4018/978-1-60960-818-7.ch101.

Full text

Abstract:

Over 30 years ago, TV shows from The Jetsons to Star Trek suggested that by the millennium’s end computers would read, talk, recognize, walk, converse, think, and maybe even feel. People do these things easily, so how hard could it be? However, in general we still don’t talk to our computers, cars, or houses, and they still don’t talk to us. The Roomba, a successful household robot, is a functional flat round machine that neither talks to nor recognizes its owner. Its “smart” programming tries mainly to stop it getting “stuck,” which it still frequently does, either by getting jammed somewhere or tangling in things like carpet tassels. The idea that computers are incredibly clever is changing, as when computers enter human specialties like conversation, many people find them more stupid than smart, as any “conversation” with a computer help can illustrate. Computers do easily do calculation tasks that people find hard, but the opposite also applies, for example, people quickly recognize familiar faces but computers still cannot recognize known terrorist faces at airport check-ins. Apparently minor variations, like lighting, facial angle, or expression, accessories like glasses or hat, upset them. Figure 1 shows a Letraset page, which any small child would easily recognize as letter “As” but computers find this extremely difficult. People find such visual tasks easy, so few in artificial intelligence (AI) appreciated the difficulties of computer-vision at first. Initial advances were rapid, but AI has struck a 99% barrier, for example, computer voice recognition is 99% accurate but one error per 100 words is unacceptable. There are no computer controlled “auto-drive” cars because 99% accuracy means an accident every month or so, which is also unacceptable. In contrast, the “mean time between accidents” of competent human drivers is years not months, and good drivers go 10+ years without accidents. Other problems easy for most people but hard for computers are language translation, speech recognition, problem solving, social interaction, and spatial coordination.

APA, Harvard, Vancouver, ISO, and other styles

Khare, Neelu, Brijendra Singh, and Munis Ahmed Rizvi. "Deep Learning Methods for Modelling Emotional Intelligence." In Multidisciplinary Applications of Deep Learning-Based Artificial Emotional Intelligence, 234–54. IGI Global, 2022. http://dx.doi.org/10.4018/978-1-6684-5673-6.ch015.

Full text

Abstract:

Machine learning and deep learning play a vital role in making smart decisions, especially with huge amounts of data. Identifying the emotional intelligence levels of individuals helps them to avoid superfluous problems in the workplace or in society. Emotions reflect the psychological state of a person or represent a quick (a few minutes or seconds) reactions to a stimulus. Emotions can be categorized on the basis of a person's feelings in a situation: positive, negative, and neutral. Emotional intelligence seeks attention from computer engineers and psychologists to work together to address EI. However, identifying human emotions through deep learning methods is still a challenging task in computer vision. This chapter investigates deep learning models for the recognition and assessment of emotional states with diverse emotional data such as speech and video streaming. Finally, the conclusion summarises the usefulness of DL methods in assessing human emotions. It helps future researchers carry out their work in the field of deep learning-based emotional artificial intelligence.

APA, Harvard, Vancouver, ISO, and other styles

Cao, Yushi, Yon Shin Teo, Yan Zheng, Yuxuan Toh, and Shang-Wei Lin. "A Holistic Automated Software Structure Exploration Framework for Testing." In Frontiers in Artificial Intelligence and Applications. IOS Press, 2022. http://dx.doi.org/10.3233/faia220259.

Full text

Abstract:

Exploring the underlying structure of a Human-Machine Interface (HMI) product effectively while adhering to the pre-defined test conditions and methodology is critical for validating the quality of the software. We propose an reinforcement-learning powered Automated Software Structure Exploration Framework for Testing (ASSET), which is capable of interacting with and analyzing the HMI software under testing (SUT). The main challenge is to incorporate the human instructions into the ASSET phase by using the visual feedback such as the downloaded image sequence from the HMI, which could be difficult to analyze. Our framework combines both computer vision and natural language processing techniques to understand the semantic meanings of the visual feedback. Building on the semantic understanding, we develop a rules-guided software exploration algorithm via reinforcement learning and deterministic finite automaton (DFA). We conducted experiments on HMI software in actual production phase and demonstrate that the exploration coverage and efficiency of our framework outperforms current start-of-art methods.

APA, Harvard, Vancouver, ISO, and other styles

Whittlestone, Jess. "AI and Decision-Making." In Future Morality, 102–10. Oxford University Press, 2021. http://dx.doi.org/10.1093/oso/9780198862086.003.0010.

Full text

Abstract:

This chapter assesses how advances in artificial intelligence (AI) can help us address the biggest global challenges we face today. Psychology research has painted a pessimistic picture of human decision-making in recent decades: documenting a whole host of biases and irrationalities people are prone to. We find it difficult to be motivated by long-term, abstract, or statistical considerations; many global challenges are far too complex for a human brain to understand in its entirety; and we cannot predict far into the future with any degree of certainty. At the same time, advances in AI are receiving increasing amounts of attention, raising the question: might we be able to leverage these AI developments to improve human decision-making on the problems that matter most for humanity’s future? If so, how? Thinking about AI more as supporting and complementing human decisions, than as replacing them, we might find that what we most need is quite far from the most sophisticated machine learning capabilities that are the subject of hype and research attention today. For many important real-world problems, what is most needed is not necessarily better computer vision or natural language processing but simpler ways to do large-scale data analysis, and practical tools for structuring reasoning and decision-making.

APA, Harvard, Vancouver, ISO, and other styles

Conference papers on the topic "Visual attention, artificial intelligence, machine learning, computer vision"

Jiao, Zhicheng, Haoxuan You, Fan Yang, Xin Li, Han Zhang, and Dinggang Shen. "Decoding EEG by Visual-guided Deep Neural Networks." In Twenty-Eighth International Joint Conference on Artificial Intelligence {IJCAI-19}. California: International Joint Conferences on Artificial Intelligence Organization, 2019. http://dx.doi.org/10.24963/ijcai.2019/192.

Full text

Abstract:

Decoding visual stimuli from brain activities is an interdisciplinary study of neuroscience and computer vision. With the emerging of Human-AI Collaboration, Human-Computer Interaction, and the development of advanced machine learning models, brain decoding based on deep learning attracts more attention. Electroencephalogram (EEG) is a widely used neurophysiology tool. Inspired by the success of deep learning on image representation and neural decoding, we proposed a visual-guided EEG decoding method that contains a decoding stage and a generation stage. In the classiﬁcation stage, we designed a visual-guided convolutional neural network (CNN) to obtain more discriminative representations from EEG, which are applied to achieve the classiﬁcation results. In the generation stage, the visual-guided EEG features are input to our improved deep generative model with a visual consistence module to generate corresponding visual stimuli. With the help of our visual-guided strategies, the proposed method outperforms traditional machine learning methods and deep learning models in the EEG decoding task.

APA, Harvard, Vancouver, ISO, and other styles

Zhang, Licheng, Xianzhi Wang, Lina Yao, Lin Wu, and Feng Zheng. "Zero-Shot Object Detection via Learning an Embedding from Semantic Space to Visual Space." In Twenty-Ninth International Joint Conference on Artificial Intelligence and Seventeenth Pacific Rim International Conference on Artificial Intelligence {IJCAI-PRICAI-20}. California: International Joint Conferences on Artificial Intelligence Organization, 2020. http://dx.doi.org/10.24963/ijcai.2020/126.

Full text

Abstract:

Zero-shot object detection (ZSD) has received considerable attention from the community of computer vision in recent years. It aims to simultaneously locate and categorize previously unseen objects during inference. One crucial problem of ZSD is how to accurately predict the label of each object proposal, i.e. categorizing object proposals, when conducting ZSD for unseen categories. Previous ZSD models generally relied on learning an embedding from visual space to semantic space or learning a joint embedding between semantic description and visual representation. As the features in the learned semantic space or the joint projected space tend to suffer from the hubness problem, namely the feature vectors are likely embedded to an area of incorrect labels, and thus it will lead to lower detection precision. In this paper, instead, we propose to learn a deep embedding from the semantic space to the visual space, which enables to well alleviate the hubness problem, because, compared with semantic space or joint embedding space, the distribution in visual space has smaller variance. After learning a deep embedding model, we perform $k$ nearest neighbor search in the visual space of unseen categories to determine the category of each semantic description. Extensive experiments on two public datasets show that our approach significantly outperforms the existing methods.

APA, Harvard, Vancouver, ISO, and other styles

Wang, Yaxiong, Hao Yang, Xueming Qian, Lin Ma, Jing Lu, Biao Li, and Xin Fan. "Position Focused Attention Network for Image-Text Matching." In Twenty-Eighth International Joint Conference on Artificial Intelligence {IJCAI-19}. California: International Joint Conferences on Artificial Intelligence Organization, 2019. http://dx.doi.org/10.24963/ijcai.2019/526.

Full text

Abstract:

Image-text matching tasks have recently attracted a lot of attention in the computer vision field. The key point of this cross-domain problem is how to accurately measure the similarity between the visual and the textual contents, which demands a fine understanding of both modalities. In this paper, we propose a novel position focused attention network (PFAN) to investigate the relation between the visual and the textual views. In this work, we integrate the object position clue to enhance the visual-text joint-embedding learning. We first split the images into blocks, by which we infer the relative position of region in the image. Then, an attention mechanism is proposed to model the relations between the image region and blocks and generate the valuable position feature, which will be further utilized to enhance the region expression and model a more reliable relationship between the visual image and the textual sentence. Experiments on the popular datasets Flickr30K and MS-COCO show the effectiveness of the proposed method. Besides the public datasets, we also conduct experiments on our collected practical news dataset (Tencent-News) to validate the practical application value of proposed method. As far as we know, this is the first attempt to test the performance on the practical application. Our method can achieve the state-of-art performance on all of these three datasets.

APA, Harvard, Vancouver, ISO, and other styles

Venkata Sai Saran Naraharisetti, Sree Veera, Benjamin Greenfield, Benjamin Placzek, Steven Atilho, Mohamad Nassar, and Mehdi Mekni. "A Novel Intelligent Image-Processing Parking Systems." In 3rd International Conference on Artificial Intelligence and Machine Learning (CAIML 2022). Academy and Industry Research Collaboration Center (AIRCC), 2022. http://dx.doi.org/10.5121/csit.2022.121212.

Full text

Abstract:

The scientific community is looking for efficient solutions to improve the quality of life in large cities because of traffic congestion, driving experience, air pollution, and energy consumption. This surge exceeds the capacity of existing transit infrastructure and parking facilities. Intelligent Parking Systems (SPS) that can accommodate short-term parking demand are a must-have for smart city development. SPS are designed to count the number of parked automobiles and identify available parking spaces. In this paper, we present a novel SPS based on real-time computer vision techniques. The proposed system provides features including: vacant parking space recognition, inappropriate parking detection, forecast of available parking spaces, and directed indicators toward various sorts of parking spaces (vacant, occupied, reserved and handicapped). Our system leverages existing video surveillance systems to capture, process image sequences, train computer models to understand and interpret the visual world, and provide guidance and information to the drivers.

APA, Harvard, Vancouver, ISO, and other styles

Diao, Xiaolei. "Building a Visual Semantics Aware Object Hierarchy." In Thirty-First International Joint Conference on Artificial Intelligence {IJCAI-22}. California: International Joint Conferences on Artificial Intelligence Organization, 2022. http://dx.doi.org/10.24963/ijcai.2022/826.

Full text

Abstract:

The semantic gap is defined as the difference between the linguistic representations of the same concept, which usually leads to misunderstanding between individuals with different knowledge backgrounds. Since linguistically annotated images are extensively used for training machine learning models, semantic gap problem (SGP) also results in inevitable bias on image annotations and further leads to poor performance on current computer vision tasks. To address this problem, we propose a novel unsupervised method to build visual semantics aware object hierarchy, aiming to get a classification model by learning from pure-visual information and to dissipate the bias of linguistic representations caused by SGP. Our intuition in this paper comes from real-world knowledge representation where concepts are hierarchically organized, and each concept can be described by a set of features rather than a linguistic annotation, namely visual semantic. The evaluation consists of two parts, firstly we apply the constructed hierarchy on the object recognition task and then we compare our visual hierarchy and existing lexical hierarchies to show the validity of our method. The preliminary results reveal the efficiency and potential of our proposed method.

APA, Harvard, Vancouver, ISO, and other styles

Lin, Jianxin, Yingce Xia, Yijun Wang, Tao Qin, and Zhibo Chen. "Image-to-Image Translation with Multi-Path Consistency Regularization." In Twenty-Eighth International Joint Conference on Artificial Intelligence {IJCAI-19}. California: International Joint Conferences on Artificial Intelligence Organization, 2019. http://dx.doi.org/10.24963/ijcai.2019/413.

Full text

Abstract:

Image translation across different domains has attracted much attention in both machine learning and computer vision communities. Taking the translation from a source domain to a target domain as an example, existing algorithms mainly rely on two kinds of loss for training: One is the discrimination loss, which is used to differentiate images generated by the models and natural images; the other is the reconstruction loss, which measures the difference between an original image and the reconstructed version. In this work, we introduce a new kind of loss, multi-path consistency loss, which evaluates the differences between direct translation from source domain to target domain and indirect translation from source domain to an auxiliary domain to target domain, to regularize training. For multi-domain translation (at least, three) which focuses on building translation models between any two domains, at each training iteration, we randomly select three domains, set them respectively as the source, auxiliary and target domains, build the multi-path consistency loss and optimize the network. For two-domain translation, we need to introduce an additional auxiliary domain and construct the multi-path consistency loss. We conduct various experiments to demonstrate the effectiveness of our proposed methods, including face-to-face translation, paint-to-photo translation, and de-raining/de-noising translation.

APA, Harvard, Vancouver, ISO, and other styles

We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

Contents

Academic literature on the topic 'Visual attention, artificial intelligence, machine learning, computer vision'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Journal articles on the topic "Visual attention, artificial intelligence, machine learning, computer vision"

Dissertations / Theses on the topic "Visual attention, artificial intelligence, machine learning, computer vision"

Books on the topic "Visual attention, artificial intelligence, machine learning, computer vision"

Book chapters on the topic "Visual attention, artificial intelligence, machine learning, computer vision"

Conference papers on the topic "Visual attention, artificial intelligence, machine learning, computer vision"